3H ITek Studio: UTF16 code points

2012年11月27日星期二

UTF16 code points

UTF-16 , 0x00-0000 -- 0x10-FFFF, 21 bits, 1,112,064 code points
UCS-2, 0x0000 -- 0xFFFF, 比較早的標準, 16 bits

the leading 5 bits(0x00 -- 0x10) separate UTF-16 into 17 planes

//=== Code points U+0000 to U+D7FF and U+E000 to U+FFFF
此範圍又稱為 BMP ( Basic Multi-lingual Plane )
大部分的常用字均位於此.

//=== Code points U+010000 to U+10FFFF
位於此範圍的 code point 將被編碼成兩個16bit的數字
lead / leading surrogate 與 trail / trailing surrogate

令 cp= code point
lead surrogate= ((cp-0x010000) >>10 )& 0x03FF + 0xD800 ; //the top 10 bits
trail surrogate= (cp-0x010000) & 0x03FF + 0xDC00; //the lower 10 bits

lead surrogate 範圍 [ 0xD800 , 0xDBFF ]
trail surrogate 範圍 [ 0xDC00, 0xDFFF ]

//=== Code points U+D800 to U+DFFF

not assigned in UTF-16 code space

所以UTF-16 是一種轉換(transform)
將 [ 0x00-0000 , 0x10-FFFF ] - [ 0xD800 , 0xDFFF ]
對應到
1個16bit 數字(2-byte char, identity transform) 或兩個 16bit 數字 (4-byte char, surrogate pair transform)

the inverse transform for UTF-16 encoding
pseudo code snippet

int c= read2bytes();
if(c < 0xD800 || c > 0xDFFF)
return c;
else {
int c2= read2bytes();
//assume c2 is correctly read out
return 0x010000 + (c-0xD800) <<10 + (c2- 0xDC00);
}

[ref]
http://en.wikipedia.org/wiki/UTF-16

3H ITek Studio

免責聲明

2012年11月27日星期二

UTF16 code points

沒有留言:

張貼留言

Haxe Links

SmartCard Infos

Rounded Corner

ThrashBox2

TrashBox Test

Office Ribbon Links

MSI Conditions Links

網誌存檔

關於我自己

免責聲明

2012年11月27日 星期二

UTF16 code points

沒有留言:

張貼留言

Haxe Links

SmartCard Infos

Rounded Corner

ThrashBox2

TrashBox Test

Office Ribbon Links

MSI Conditions Links

網誌存檔

關於我自己

2012年11月27日星期二