2005-06-17

Why we all hate normalization checking

Posted to the XML Core Working Group mailing list before XML 1.1 became a W3C Recommendation:

There are two kinds of people who didn't want to make XML 1.1 require normalization checking: the Lazy Document Generators and the Lazy Parser Programmers.

Lazy Document Generators want to be able to spew their random Unicode cruft straight into XML documents without worrying about what semantics it might have, and recreate said cruft at the receiver exactly as sent. The fact that the document might contain one million consecutive COMBINING CIRCUMFLEX ACCENTs bothers them not in the least. It's someone else's problem.

Lazy Parser Programmers don't want to bother to put together the necessary few lines of code, according to a well-documented algorithm, to check that documents do not contain gratuitous decompositions like LATIN SMALL LETTER A followed by COMBINING CIRCUMFLEX ACCENT, when obviously LATIN SMALL LETTER A WITH CIRCUMFLEX is what everyone has in their Latin-1 fonts and keyboards, and so is likely to expect. What do they care if their users go blind poring over hex dumps of their documents, trying to figure out where the discrepancy comes from? It's someone else's problem.

LDGs don't want it to be the case that "it is an error" (not necessarily detected) for a document to violate normalization. LPPs don't want to require parsers to check normalization at user option, since then they have to write the code even if it is not used much of the time. The Core WG will have to decide whether to p*ss off one group, both, or neither.

Of course, there are also XML 1.0 Forever types, who sit on xml-dev and chant "No Change! No Change!". X1Fs demand to be paid only in paper money.

This discharges my action.

The upshot was that XML 1.1 parsers SHOULD check for normalization but don't have to.

No comments: