Beware of XHTML

If you're a web developer, you've probably worked a lot with XHTML, the markup language developed in 1999 to implement HTML as an XML format. Most people who use and promote XHTML do so because they think it's the “next version” of HTML, and they may have heard of some benefits here and there. But there is a lot more to it than you may realize, and if you're using it on your website, even if it validates, you are probably using it incorrectly.

I believe that XHTML has many good potential applications, and I hope it continues to thrive as a standard. This is precisely why I have written this article. The state of XHTML on the Web today is more broken than the state of HTML, and most people don't realize because the major browsers are using classic HTML parsers that hide the problems. Even among the few sites that know how to trigger the XML parser, the authors tend to overlook some important issues. If you really hope for the XHTML standard to succeed, you should read this article carefully.

Table of Contents

  1. What is XHTML?
  2. Myths of XHTML
  3. Benefits of XML
  4. Content type is everything
  5. HTML compatibility guidelines
  6. Internet Explorer incompatibility
  7. Content negotiation
  8. Null End Tags (NET)
  9. Firefox and other problems
  10. Conclusion
  11. Quotes
  12. List of standards-related sites that break as XHTML
  13. List of standards-related sites that stick with HTML
  14. Related sites
  15. See also

What is XHTML?

Up

XHTML is a markup language originally hoped to someday replace HTML on the Web. For the most part, an XHTML 1.0 document differs from an HTML 4.01 document only in the lexical and syntactic rules: HTML is written in its own unique syntax defined by SGML, while XHTML is written in a different SGML-defined syntax called XML. The syntaxes differ in some of the characters that delimit tags and other constructs, whether or not certain types of shorthand markup may be used, and whether or not tag names or character entities are case sensitive, among other small differences.

The Document Type Definition (DTD, which is referenced by the doctype declaration) then defines which elements, attributes, and character entities exist in the language and where those elements may be placed. The DTDs of XHTML 1.0 and HTML 4.01 are nearly identical, meaning that as far as things like elements and attributes go, XHTML 1.0 and HTML 4.01 are basically the same language. The only added benefit of XHTML is that it's written in XML and shares the benefits XML has over HTML's syntax. I'll explain those benefits later in this article, but first I'd like to debunk some of the false benefits you may have heard.

Myths of XHTML

Up

There are many false benefits of XHTML promoted on the Web. Let's clear up some of them at a glance (with details and other pitfalls provided later):

Benefits of XML

Up

XML does have a number of improvements over HTML's syntax:

Content type is everything

Up

When your website sends a document to the visitor's browser, it adds on a special content type header that lets the browser know what kind of document it's dealing with. For example, a PNG image has the content type image/png and a CSS file has the content type text/css. HTML documents have the content type text/html. Web servers typically send this content type whenever the file extension is .html, and server-side scripting languages like PHP also typically send documents as text/html by default.

XHTML does not have the same content type as HTML. The proper content type for XHTML is application/xhtml+xml. Currently, many web servers don't have this content type reserved for any file extension, so you would need to modify the server configuration files or use a server-side scripting language to send the header manually. Simply specifying the content type in a meta element will not work over HTTP.

When a web browser sees the text/html content type, regardless of what the doctype says, it automatically assumes that it's dealing with plain old HTML. Therefore, rather than using the XML parsing engine, it treats the document like tag soup, expecting HTML content. Because HTML 4.01 and simple XHTML 1.0 are often very similar, the browser can still understand the page fairly well. Most major browsers consider things like the self-closing portion of a tag (as in <br />) as a simple HTML error and strip it out, usually ending up with the HTML equivalent of what the author intended.

However, when the document is treated like HTML, you get none of the benefits XHTML offers. The browser won't understand other XML formats like MathML and SVG that are included in the document, and it won't do the automatic validation that XML parsers do. In order for the document to be treated properly, the server would need to send the application/xhtml+xml content type.

The problems go deeper. Comment markers are sometimes handled differently depending on the content type, and when you enclose the contents of a script or style element with basic SGML-style comments, it will cause your script and style information to be completely ignored when the document is treated like XML. Also, any special markup characters used in the inline contents of a style or script element will be parsed as markup instead of being treated as character data like in HTML. To solve these problems, you must use an elaborate escape sequence described in the article Escaping Style and Script Data, and even then there are situations in which it won't work.

Furthermore, the CSS and DOM specifications have special provisions for HTML that don't apply to XHTML when it's treated as XML, so your page may look and behave in unexpected ways. The most common problem is a white gap around your page if you have a background on the body, no background on the html element, and any kind of spacing between the elements, such as a margin, padding, or a body height under 100% (browsers typically have some combination of these by default). In scripting, tag names are returned differently and document.write() doesn't work in XHTML treated as XML. Table structure in the DOM is different between the two parsing modes. These are only a select few of the many differences.

The following are some examples of differing behavior between XHTML treated as HTML and XHTML treated as XML. The anticipated results are based on the way Internet Explorer, Firefox, and Opera treat XHTML served as HTML. Some other browsers are known to behave differently. Also note that Internet Explorer doesn't recognize the application/xhtml+xml content type (see below for an explanation), so it will not be able to view the examples in the second column.

Differences in XHTML handling
text/html application/xhtml+xml
Example 1 Example 1
Example 2 Example 2
Example 3 Example 3
Example 4 Example 4
Example 5 Example 5
Example 6 Example 6
Example 7 Example 7
Example 8 Example 8
Example 9 Example 9

HTML compatibility guidelines

Up

When the XHTML 1.0 specification was first written, there were provisions that allowed an XHTML document to be sent as text/html as long as certain compatibility guidelines were followed. The idea was to ease migration to the new format without breaking old user agents. However, these provisions are now viewed by many as a mistake. The whole point of XHTML is to be an XML alternative to HTML, yet due to the allowance of XHTML documents to be sent as text/html, most so-called XHTML documents on the Web today would break if they were treated like XML (see the real-world examples below). This even includes many valid XHTML documents. Several prominent members of the W3C are now challenging the wisdom of the text/html provisions and advocating that this content type should never be allowed for XHTML.

Many authors incorrectly believe that following the HTML compatibility guidelines and validating the document will guarantee that the document is compatible with both the HTML and XHTML specifications. In reality, if you use even a single self-closing tag in the document (which includes any link, img, or br tag), you are already creating incompatibilities between the two specifications. The reason for this particular issue is explained below. In this article, I have already explained a number of other factors not covered in XHTML 1.0 Appendix C that will also cause the document to run into incompatibilities. The truth is that the HTML compatibility guidelines do not actually provide true compatibility between HTML and XHTML; they merely attempt to minimize the damage of using text/html for XHTML documents, and that damage control is very limited in effectiveness.

XHTML 1.x already makes no provision for the use of text/html when taking advantage of any XHTML features not present in HTML, and the current draft of XHTML 2 expressly forbids it.

Internet Explorer incompatibility

Up

Internet Explorer does not support XHTML. Like other web browsers, when a document is sent as text/html, it treats the document as if it was a poorly constructed HTML document. However, when the document is sent as application/xhtml+xml, Internet Explorer won't recognize it as a webpage; instead, it will simply present the user with a download dialog. This issue still exists in Internet Explorer 7.

Although all other major web browsers, including Firefox, Opera, Safari, and Konqueror, support XHTML, the lack of support in Internet Explorer as well as major search engines and web applications makes use of it very discouraged.

Content negotiation

Up

Content negotiation is the idea of sending different content depending on what the user agent supports. Many sites attempt to send XHTML as application/xhtml+xml to those who support it, and either XHTML as text/html or real HTML to those who don't.

There are two methods generally used to determine what the user agent supports, using the Accept HTTP header: most often, sites use the incorrect method where they simply look for the string “application/xhtml+xml” in the header value; although some sites will use the correct method, where they actually parse the header value, supporting wildcards and ordering by q value.

Unfortunately, neither of these methods works reliably.

The first method doesn't work because not all XHTML-supporting user agents actually have the text “application/xhtml+xml” in the Accept header. Safari and Konqueror are two such browsers. The application/xhtml+xml content type is implied by a wildcard value instead. Meanwhile, not all HTML-supporting user agents have “text/html” in the header. Internet Explorer, for example, doesn't mention this content type. Like Safari and Konqueror, it implies this support by using a wildcard. Even among those user agents that support XHTML and mention application/xhtml+xml in the header, it may have a lower q value than text/html (or a matching wildcard), which implies that the user agent actually prefers text/html (in other words, its XHTML support may be experimental or broken).

The second method (the correct, 100% standards-complaint one) doesn't work because most major browsers have inaccurate Accept headers:

As disappointing as it may be, content negotiation simply isn't a reliable approach to this problem.

Null End Tags (NET)

Up

In XHTML, all elements are required to be closed, either by an end tag or by adding a slash to the start tag to make it self-closing. Since giving empty elements like img or br an end tag would confuse browsers treating the page like HTML, self-closing tags tend to be promoted. However, XML self-closing tags directly conflict with a little-known and poorly supported HTML/SGML feature: Null End Tags.

A Null End Tag is a special shorthand form of a tag that allows you to save a few characters in the document. Instead of writing <title>My page</title>, you could simply write <title/My page/ to accomplish the same thing. Due to the rules of Null End Tags, a single slash in an empty element's start tag would close the tag right then and there, meaning <br/ is a complete and valid tag in HTML. As a result, if you have <br/> or <br />, a browser supporting Null End Tags would see that as a br element immediately followed by a simple > character. Therefore, an XHTML page treated as HTML could be littered with unwanted > characters.

This problem is often overlooked because most popular browsers today are lacking support for Null End Tags, as well as some other SGML shorthand features. However, there are still some smaller user agents that properly support Null End Tags. One of the more well-known user agents that support it is the W3C validator. If you send it a page that uses XHTML self-closing tags, but force it to parse the page as HTML/SGML like most user agents do for text/html pages, you can see the results in the outline: immediately after each of the self-closing elements, there is an unwanted > character that will be displayed on the page itself.

(It should be noted that the W3C Validator is unusual in that it generally determines the parsing mode from the doctype, rather than from the content type as most other user agents do. Therefore, an HTML doctype was used in the above example just so the validator would attempt to parse the page using the HTML syntax as all major browsers will for text/html pages regardless of the doctype. The Null End Tag rules are actually set in the SGML syntax definition, not the DTD, so this example is accurate to what you should expect in a fully compliant SGML user agent even with an XHTML doctype.)

Technically, a restricted and altered form of Null End Tags exists in XML and is frequently used: the self-closing portion of the start tag. While Null End Tags are defined as / ... / in HTML's syntax, they are specially defined as / ... > in XML with the added restriction that it must close immediately after it is opened, meaning the element must have no content. This was designed to look similar to a regular start tag for web developers who are unfamiliar with typical Null End Tags. However, in the process it creates inherent incompatibility with HTML's syntax for all empty elements.

In summary, although this issue doesn't show in most popular web browsers, a user agent that more fully supports SGML would see unwanted > characters all over XHTML pages that are sent with the text/html content type. If the goal of using XHTML is to help promote standards, then it's quite counterproductive to cause unnecessary problems for user agents that more correctly comply to the SGML standard.

Firefox and other problems

Up

Although Firefox supports the parsing of XHTML documents as XML when sent with the application/xhtml+xml content type, its performance in versions 2.0 and below is actually worse than with HTML. When parsing a page as HTML, Firefox will begin displaying the page while the content is being downloaded. This is called incremental rendering. However, when it's parsing XML content, Firefox 2.0 and below will wait until the entire page is downloaded and checked for well-formedness before any of the content is displayed. This means that, although in theory XML is supposed to be faster to parse than HTML, in reality these versions of Firefox usually display HTML content to the user much faster than XHTML/XML content. Thankfully, this issue is expected to be resolved in Firefox 3.0.

However, there are also issues in other browsers, such as certain HTML-specific provisions in the CSS and DOM standards being mistakenly applied to XHTML content parsed as XML. For example, if there is a background set on the body element and none on the html element, Opera will apply the background to the html element as it would in HTML. So even when dealing exclusively with XHTML parsed as XML, you still run into a number of the same problems that you do when trying to serve XHTML either way.

All in all, true XHTML support in major user agents is still very weak. Because a key user agent — namely, Internet Explorer — has made no visible effort to support XHTML, other major user agents have continued to see it as a relatively low priority and so these bugs have lingered. HTML is recommended over XHTML by both Mozilla and Safari and is generally better supported than XHTML by all major browsers.

Conclusion

Up

XHTML is a very good thing, and I certainly hope to see it gain widespread acceptance in the future. However, it simply isn't widely supported in its proper form. XHTML is an XML format, and to force a web browser to treat it like HTML is going against the whole purpose of XHTML and also inevitably causes other complications. Assuming you don't want to dramatically limit access to your information, XHTML can only be used incorrectly, be interpretted as invalid markup by most user agents, cause unwanted results in others, and offer no added benefit over HTML. HTML 4.01 Strict is still what most user agents and search engines are most accustomed to, and there's absolutely nothing wrong with using it if you don't need the added benefits of XML. HTML 4.01 is still a W3C Recommendation, and the W3C has even announced plans to further develop HTML alongside XHTML in the future.

Quotes

Up

List of standards-related sites that break as XHTML

Up

The following are just a few of the countless sites that use an XHTML doctype but, as of this moment of writing, completely fail to load or otherwise work improperly when parsed as XML, thus missing the whole point of XHTML. The authors of most of these sites are quite prominent in the web standards community — many are involved in the Web Standards Project (WaSP) — yet they have still fallen victim to the pitfalls of current use of XHTML. In fact, I have found that nearly all XHTML websites owned by WaSP members have problems when parsed as XML.

You could consider this a “shame list” of sorts. These are the same people who are supposed to be teaching others how to use web standards properly, yet they have written markup that basically depends on browsers treating it incorrectly. But the main point of this list isn't to pick on individuals; it's to reinforce the fact that even so-called experts at web standards have trouble juggling the different ways XHTML will inevitably be handled on the Web. And what benefit does it bring? None of the following sites make use of anything XHTML offers over HTML.

The following “View as application/xhtml+xml” links allow you to see how the pages would look when sent with the proper XHTML content type. This script adds a base element so that relative URLs work properly, but no other modifications are made to the markup. Alternatively, you can test the original unaltered page's XHTML rendering in Firefox using the Force Content-type extension and setting the new content-type to application/xhtml+xml.

These links were last checked 2007-09-23.

Accessify - WaSP Steering Committee, Accessibility Task Force
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted.
View as application/xhtml+xml
A List Apart - various WaSP members
Brown gap below A List Apart logo (site relies on “Almost Standards Mode”, a mode in between Quirks Mode and Standards Mode that only exists for Transitional documents parsed as HTML). Ad and “Job Board” section don't appear, because they rely on an HTML-specific DOM method.
View as application/xhtml+xml
all in the <head> - WaSP Steering Committee
Page doesn't load. Not well-formed. (Note: this page is valid according to the XHTML DTD and XML's SGML-defined syntax, but XML has additional well-formedness rules that this page breaks, observed in the Textpattern and the Technorati Link Count Widget post. A similar test case is available.)
View as application/xhtml+xml
CSS Zen Garden - WaSP
Top background doesn't display. The page relies on HTML-specific background behavior. Numerous designs have errors with a similar cause.
View as application/xhtml+xml
dean.edwards.name/weblog/ - WaSP DOM Scripting Task Force, Microsoft Task Force
For browsers that support behavior binding (including Firefox) for the dynamic syntax highlighting of the code snippits, most of the code boxes fail to load the contents, resulting in many empty boxes where code snippits should be, or the code appears without syntax highlighting.
View as application/xhtml+xml
dog or higher
Page doesn't load. Not well-formed.
View as application/xhtml+xml
Elly Thompson's Weblog
Page doesn't load. Not well-formed.
View as application/xhtml+xml
holly marie - WaSP Steering Committee
Thick white gap at top and bottom of the page. This page relies on HTML-specific background behavior.
View as application/xhtml+xml
Jeffrey Veen - WaSP emeritus
Page doesn't load. Not well-formed.
View as application/xhtml+xml
Meriblog
Page doesn't load. Not well-formed.
View as application/xhtml+xml
mezzoblue - WaSP
Displayed as generic XML, not interpretted as XHTML. The XML namespace was omitted. Also, individual post pages don't load. Not well-formed.
View as application/xhtml+xml
molly.com - WaSP Group Lead
Page doesn't load. Not well-formed. This error is in a post that's currently on the front page, so the front page will be viewable again once that post drops off. Aside from that error, Flickr and “Elsewhere” sections fail to load because the script contents are commented out or rely on HTML-specific DOM methods.
View as application/xhtml+xml
Off the Top - WaSP Steering Committee
Page doesn't load. Not well-formed.
View as application/xhtml+xml
Position Is Everything
Page doesn't load. Not well-formed.
View as application/xhtml+xml

List of standards-related sites that stick with HTML

Up

The following are some significant sites relevant to web standards that continue to use HTML rather than XHTML.

See also

Up