Validity and well-formedness

There is a lot of confusion among web developers regarding the terms “valid” and “well-formed”. Just because an XHTML document is well-formed doesn't mean it's valid, and despite popular assumption, just because an XHTML document is valid (according to the classic definition used by the current W3C Validator) doesn't mean it's well-formed. An XHTML document which is valid may still have critical unnoticed markup errors.

Update: Since the initial publication of this article, the W3C has developed a new version of the validator which uses an XML parser for XHTML and other XML documents. Thus, if it encounters a well-formedness error, it now fails validation. This article still holds true for generic SGML parsers.

Introduction to validity
Introduction to well-formedness
User agent requirements with XML
Regarding XHTML
Validity is not well-formedness
XML's definition of “validity”
See also

Introduction to validity

SGML documents, including XML documents, should come with a DTD, usually by using a simple doctype reference. The DTD informs the user agent as to which elements and attributes may exist in the document and where they may be. If an element occurs in the document at an unexpected place or has an unexpected attribute which violates the rules set in the DTD, this is called a “validation error”. A “valid” document is a document which conforms to the rules specified in the DTD, as well as the basic SGML parsing rules.

When you run a webpage through the W3C HTML Validator, it is checking for validity.

Introduction to well-formedness

XML was designed to have an extra set of rules called “well-formedness” rules. Well-formedness has nothing to do with the types of elements and attributes in the document. Instead, it is a basic syntax which all XML documents must follow. It deals with the individual characters which delimit tags, attributes, processing instructions, marked sections, character data, etc.

User agent requirements with XML

When a user agent (such as a web browser) parses an XML document, it is supposed to check for well-formedness. If it comes across any well-formedness error, the user agent immediately quits trying to parse the page, and it will sometimes display a parse error message instead.

Although user agents are supposed to check for well-formedness, they are not required to check for validity, and web browsers usually don't.

Regarding XHTML

XHTML was designed to be an XML version of HTML. However, for various reasons discussed in the Beware of XHTML article, most XHTML pages on the Web are not parsed as XML by most popular web browsers. Instead, they usually treat the page as if it were simply HTML with some odd unrecognized / characters and attributes here and there. This means that browsers won't check the XHTML page for well-formedness and, as the author, you can't expect the browser to give any indication of whether or not the page is well-formed.

Now, just because most popular browsers usually won't treat the page as XML, that doesn't mean nothing will. XHTML is supposed to be parsable as XML, and you should expect user agents to try to parse it as such (and many will). That means you have to make sure the document is well-formed.

Validity is not well-formedness

Many people assume that they may simply run their XHTML page through the W3C HTML Validator and determine whether or not it is properly written. This is a false assumption. The W3C HTML Validator checks for validity from an SGML point of view, but there are some well-formedness rules which it does not check.

With a basic understanding of validity and well-formedness, it is easy to see why a document may be well-formed yet not valid. If, for example, you write an XHTML document with a misspelled tag name, that document may still be parsable as XML (in other words, it may still be well-formed), but it will obviously be invalid because your spelling of the tag name isn't mentioned in the DTD.

But what a lot of people don't realize is that it's possible for a document to be valid yet not well-formed. This issue comes up a lot more often than you may think, and is often completely overlooked until someone stumbles upon the fringe cases where something tries to parse the page as XML.

The following are examples of XHTML documents which are perfectly valid from an SGML point of view but not well-formed. Even though the W3C HTML Validator gives these pages a green light, they will completely fail to load in any XML parser. Notes: Most XHTML pages on the Web today are parsed as HTML most of the time, as explained in the Beware of XHTML article. Also, because Internet Explorer doesn't yet support XHTML parsed as XML, you'll have to use a different browser to see the problems.

Parsed as: HTML, XML (validate)
Parsed as: HTML, XML (validate)
Parsed as: HTML, XML (validate)
Parsed as: HTML, XML (validate)
Parsed as: HTML, XML (validate)
Parsed as: HTML, XML (validate)

XML's definition of “validity”

Since XML is defined as an SGML language, it inherits SGML's definition of “validity” which the W3C Validator checks for. However, the XML specification uses some wording which may cause a bit of confusion on this issue. The relevant line is as follows:

Definition: A data object is an XML document if it is well-formed, as defined in this specification. In addition, the XML document is valid if it meets certain further constraints.

Those “further constraints” involve two main checkpoints: compliance with the associated DTD, and compliance with the rules in the XML specification which are explicitly marked as “validity constraint”. If a document meets these criteria and also meets all of the well-formedness constraints, that document will inherently be valid according to SGML's definition of the term. So what this is saying is that a well-formed document which meets the “further constraints” will inherently be valid.

However, this wording does not claim that a document must be well-formed in order to be considered valid. It would be considered a malformed XML document (by this definition, not a “true” XML document, and not parsable using an XML-specific parser), but that malformed document doesn't necessarily violate any validity requirement in either the XML specification or the SGML standard in general, which is why the W3C Validator gives the result it gives.