2005-08-22

Unicode is big enough

People tend to be skeptical that the 17 * 65536 = 1,114,112 character codes provided by Unicode will be big enough. After all, we have moved from 8-bit to 64-bit computers, both in word size and in address size; in general, most finite limits have been repeatedly shown to be insufficient. The maximum normal memory on MS-DOS-based PCs was 640K, ten times as big as the 64K limit on the 8-bit systems that preceded them: after all, as Bill Gates supposedly said back in 1981, 640K of memory ought to be enough for anybody!

In fact, though, there just aren't any huge and complicated writing systems hiding in some remote ravine. We have a pretty good map of all the writing systems on the planet; a few may have been overlooked by accident, but none of them are going to be huge. The biggest remaining ones are Egyptian hieroglyphics and ancient Chinese characters, and neither of them will require anything like a million character codes.

There are other ceilings in computing that aren't likely to be broken through either. Consider the number of different assembly-language op codes. Does anyone foresee computer chips with 65,536 different opcodes? How about 4,294,967,296 distinct opcodes? I don't think so.

Or consider IP version 6 network addresses. There are 2128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 of them. They won't be assigned densely, according to current plans, but they could be, and that would be enough IP addresses to have a few billion addresses for every soil bacterium in every square centimeter of soil on the planet. Does anybody really believe we are going to "break through" that?

1 comment:

John Cowan said...

Aristotle: thanks for the heads-up.

Anton: originally the design of Unicode only allowed for 65536 characters; when that was seen to be insufficient, 2048 codepoints were reserved to be used in pairs, thus allowing an additional 1024 * 1024 = 16 * 65536 potential characters.

Each group of 65536 characters is called a "plane". Plane 0 contains most of the characters in actual living use. Plane 1 is used mostly for archaic writing systems, Plane 2 for rare Chinese characters (and Plane 3 can be used if Plane 2 fills up). Plane 15 is for special formatting and control characters, and Planes 16 and 17 are reserved for private use. The other planes will almost certainly never be used.