Tuesday, April 15, 2008

Unicode 5.1 release and Indic changes

Unicode 5.1 release was announced earlier this month on 4th April. Here I have put a diff taken of Unicode 5.1 character database against that of Unicode 5.0. My buddy, Parag also did a nice job of summarizing the Indic specific changes, that I am trying to restate now.

So, here go the updates on Indian scripts UCD:

A. New Indic Scripts Added to Unicode:

1. LEPCHA:

Lepcha is a language spoken by the Lepcha people in Sikkim in India,and parts of Nepal and Bhutan. The Lepcha script (also known as "róng") is a syllabic script which has a lot of special marks and requires ligatures. Its genealogy is unclear. Early Lepcha manuscripts were written vertically, a sign of Chinese influence. Lepcha is considered to be one of the aboriginal languages of the area in which it is spoken.

Total number of speakers numbers near 50,000.

Unicode Range =>U1C00 to U1C4F

Chart URL => http://www.unicode.org/charts/PDF/U1C00.pdf

2. OL-CHIKI:

The Ol Chiki script, also known as Ol Cemetʼ ("language of writing"), Ol Ciki, Ol (and sometimes as the Santali alphabet), was created in 1925 by Pandit Raghunath Murmu for the Santali language. Santali is a language in the Munda subfamily of Austro-Asiatic, related to Ho and Mundari. It is spoken by about six million people in India, Bangladesh, Nepal, and Bhutan[citation needed]. Most of its speakers live in India, in the states of Jharkhand, Assam, Bihar, Orissa, Tripura, and West Bengal. It has its own alphabet, known as Ol Chiki, but literacy is very low, between 10 and 30%. Santali is spoken by the Santals.

Unicode Range => U1C50 to U1C7F

Chart URL => http://www.unicode.org/charts/PDF/U1C50.pdf

3. SAURASHTRA :

Saurashtra, more correctly, Sauraṣṭri or Sauraṣṭram or Sourashtra, also known as Palkar, Sowrashtra, Saurashtram, is an Indo-Aryan language spoken in parts of the Southern Indian State of Tamil Nadu. The Saurashtra community is referred to by the same name, or sometimes by the Tamil name, Pattunoolkaarar. The Ethnologue puts the number of speakers at 510,000 (1997 IMA), although the actual number could be double this figure or even more.

Unicode Range => UA880 to UA8D9

Chart URL => http://www.unicode.org/charts/PDF/UA880.pdf


B. Updates to Existing SCripts in Unicode:

1. DEVANAGARI (2 New Characters):

0971; SIGN HIGH SPACING DOT
0972; LETTER CANDRA A


2. GURMUKHI (2 New Characters):

0A51; SIGN UDAAT
0A75; SIGN YAKASH


3. ORIYA (3 New Characters):

0B44; VOWEL SIGN VOCALIC RR
0B62; VOWEL SIGN VOCALIC L
0B63; VOWEL SIGN VOCALIC LL

4. TAMIL (1 New Characters):

0BD0; OM

5. TELUGU (13 New Characters):

0C3D; SIGN AVAGRAHA
0C58; LETTER TSA
0C59; LETTER DZA
0C62; VOWEL SIGN VOCALIC L
0C63; VOWEL SIGN VOCALIC LL
0C78; FRACTION DIGIT ZERO FOR ODD POWERS OF FOUR
0C79; FRACTION DIGIT ONE FOR ODD POWERS OF FOUR
0C7A; FRACTION DIGIT TWO FOR ODD POWERS OF FOUR
0C7B; FRACTION DIGIT THREE FOR ODD POWERS OF FOUR
0C7C; FRACTION DIGIT ONE FOR EVEN POWERS OF FOUR
0C7D; FRACTION DIGIT TWO FOR EVEN POWERS OF FOUR
0C7E; FRACTION DIGIT THREE FOR EVEN POWERS OF FOUR
0C7F; SIGN TUUMU

6. MALAYALAM (17 New Characters):

0D3D; SIGN AVAGRAHA
0D44; VOWEL SIGN VOCALIC RR
0D62; VOWEL SIGN VOCALIC L
0D63; VOWEL SIGN VOCALIC LL
0D70; NUMBER TEN
0D71; NUMBER ONE HUNDRED
0D72; NUMBER ONE THOUSAND
0D73; FRACTION ONE QUARTER
0D74; FRACTION ONE HALF
0D75; FRACTION THREE QUARTERS
0D79; DATE MARK
0D7A; LETTER CHILLU NN
0D7B; LETTER CHILLU N
0D7C; LETTER CHILLU RR
0D7D; LETTER CHILLU L
0D7E; LETTER CHILLU LL
0D7F; LETTER CHILLU K

All the New Unicode Charts can now be found here:

http://www.unicode.org/charts/


Changes to Tamil and Malayalam have a lot more to discuss than just additional characters. On one side, I think Tamil community would be happy about Unicode rewarding Tamil Named Character Sequences to simplify the script processing, on other side, Malayalam community is not so happy about the Atomic Chillu Characters. Here is their opposition.

I am myself very happy about the 0972 (Letter Candra A) being added to Devanagari. This will help fixing the 'Apple' and 'Anaconda' for Marathi. Also, the inclusion of Ol-Chiki script is a very good initiative.

There is actually a lot of work to be done related to all these changes, ranging through fonts, rendering, keymaps, locales etc. I will have to come up with the details of all that very soon.