UTF-16 , 0x00-0000 -- 0x10-FFFF, 21 bits, 1,112,064 code points
UCS-2, 0x0000 -- 0xFFFF, 比較早的標準, 16 bits
the leading 5 bits(0x00 -- 0x10) separate UTF-16 into 17 planes
//=== Code points U+0000 to U+D7FF and U+E000 to U+FFFF
此範圍又稱為 BMP ( Basic Multi-lingual Plane )
大部分的常用字 均位於此.
//=== Code points U+010000 to U+10FFFF
位於此範圍的 code point 將被編碼成兩個16bit的數字
lead / leading surrogate 與 trail / trailing surrogate
令 cp= code point
lead surrogate= ((cp-0x010000) >>10 )& 0x03FF + 0xD800 ; //the top 10 bits
trail surrogate= (cp-0x010000) & 0x03FF + 0xDC00; //the lower 10 bits
lead surrogate 範圍 [ 0xD800 , 0xDBFF ]
trail surrogate 範圍 [ 0xDC00, 0xDFFF ]
//=== Code points U+D800 to U+DFFF
not assigned in UTF-16 code space
所以UTF-16 是一種轉換(transform)
將 [ 0x00-0000 , 0x10-FFFF ] - [ 0xD800 , 0xDFFF ]
對應到
1個16bit 數字(2-byte char, identity transform) 或 兩個 16bit 數字 (4-byte char, surrogate pair transform)
the inverse transform for UTF-16 encoding
pseudo code snippet
int c= read2bytes();
if(c < 0xD800 || c > 0xDFFF)
return c;
else {
int c2= read2bytes();
//assume c2 is correctly read out
return 0x010000 + (c-0xD800) <<10 + (c2- 0xDC00);
}
[ref]
http://en.wikipedia.org/wiki/UTF-16
沒有留言:
張貼留言