Sam is talking about using HTTP. He reference's Tim Bray's WS-Pagecount article on complexity. The assumption is that HTTP is simple. Its not. There are lots of things to watch out for.

Starting with identity: when is "a" an "a"? When the Unicode matches? Everyone pretty much agrees that "A" is x41. This is an "attractive nuisance." That's fine as far as it goes, but some things look the same, but are different. A and Alpha (although they look the same) are not the same (different Unicode encodings). There are four encodings for "i" in unicode with different semantics.

If you allow unicode in domain names, people can't distinguish visually. Use the Cyrillic "a", for example to register the name "" and buy a certificate for it. Two glyphs may look the same and be the same codepoint or different codepoints. Each codepoint has multiple encodings.

So, how do HTML, XML, and MS encode things? HTML: iso-8850-1, XML: utf-8, and MS: win-1252. This can create real garbage. You can't, for example, cut and paste URLs from HTML into RSS. Converters can't work well (vs. don't work well). Most web clients are on Windows. Most Web servers don't indicate encodings.

So, here's some more. Is case important in domain names? Do relative paths make a difference? Null fragments and queries? Unicode encodings? The specs are silent. CLR (C#) returns trues for all of these (they're equal). Java doesn't return true for all. XML Namespaces requires each of these to be considered distinct.

Yahoo! Search Web Services, for example was inputing iso-8859-1 and outputting utf-8.

Ruby's postulate: The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata. But the HTTP spec doesn't agree. The further away the encoding specification from the data, the more you should consider it correct, according to the spec. That is, the web server's specification overrides specification of encoding in the document itself.

Escaping in XML is broken. You can't look at a random string and tell if its escaped. This trips up seasoned professionals everyday.

Comparing characters and URIs is surprisingly more difficult and important than you might image (e.g. security holes). The specs create the confusion rather than solving it. Here's Sam's slides.

Please leave comments using the sidebar.