Web Devout tidings

Archive for the 'Web design theory' Category

Validity and well-formedness

Tuesday, February 20th, 2007

I’ve just published a new web development article called Validity and Well-Formedness, which explains the distinctions between valid and well-formed XHTML.

If the W3C HTML Validator says your XHTML page is valid, that means it’s also well-formed, right? Wrong! This article has several examples of XHTML documents which are perfectly valid but are malformed and won’t even load in an XML parser.

Bulletproof HTML: 37 Steps to Perfect Markup

Wednesday, November 1st, 2006

Bulletproof HTML: 37 Steps to Perfect Markup is a well-researched article by Tommy Olsson answering 37 general questions about HTML and XHTML that every web developer should know but too few do. Even if you’re an advanced web developer, chances are there are some bits of knowledge in here that you didn’t know.

There is no solution for the Q element

Tuesday, September 26th, 2006

Stacey Cordoni recently posted an article on A List Apart entitled "Long Live the Q Tag". The article discusses the problems with the q element stemming from Internet Explorer’s continuing lack of support and talks about some alternative solutions. The solution she settles on is to use q elements with the default quotation marks removed via CSS styling and manual quotation mark characters added directly in the HTML source outside the element.

This is not an adequate solution. It completely ignores user agents that don’t support CSS or have it disabled. Text browsers that correctly support HTML and don’t support CSS would render such a quotation delimited by two pairs of quotation marks. lynx is such a browser.

I have found that there is no true solution for the problem with the q element. Unfortunately, the problem isn’t exclusive to Internet Explorer either, as there are other user agents that fail to handle the q element correctly. ELinks behaves like Internet Explorer in this respect.

The reality is that the q element simply won’t behave consistently in all major browsers no matter what you do, and so its use should be avoided.


Monday, September 11th, 2006

Errors reported by the HTML Validator often mention SHORTTAG and OMITTAG in the description. This has caused some confusion, so I will explain what these two features are and where they come from.

Every SGML language, including HTML and XML, has something called an SGML declaration that defines the lexical rules of the language. It defines which characters are used to delimit tags and other constructs, which character ranges can be used for element names, what kinds of constructs may exist in the document, and other things. The SGML declaration usually isn’t included in the document itself (in fact, most user agents won’t support it if it is), but is usually buried somewhere in the language’s specifications — HTML SGML declaration, XML SGML declaration. Although browsers typically don’t parse actual SGML declarations, they typically choose which parsing rules to follow based on the HTTP Content-Type header. The HTML Validator is unusual in that it actually selects the parsing mode based on the doctype, so it parses a document with an XHTML doctype as XML even if it’s sent with Content-Type: text/html.

The SGML declaration defines a heirarchy of settings. One of the main categories is FEATURES, whose first subcategory is MINIMIZE. This is where you will find the SHORTTAG and OMITTAG feature settings.

OMITTAG defines whether or not start or end tags may ever be omitted. If YES, elements may define in the DTD whether start or end tags may be omitted. If NO, regardless of what the DTD says, they may never be omitted. OMITTAG is YES in HTML, but NO in XML and thus XHTML.

SHORTTAG then defines whether or not general shorthand features may be used. The format for this is different between the HTML SGML declaration and the XML SGML declaration. XML uses an extended format that toggles a number of features individually, while HTML (and classic SGML declarations) uses a single boolean value for all of the features. SHORTTAG consists of three main categories: STARTTAG, ENDTAG, and ATTRIB.

STARTTAG deals with start tags and contains three features: EMPTY, UNCLOSED, and NETENABL.

EMPTY defines whether or not the contents of the tag may be omitted. This is not the same as whether or not the contents of the element may be omitted. An empty start tag may look like this: <>. Instead of specifying the element name, it is assumed to be the same kind of element as the previous sibling (the element that most recently closed). This is legal (YES) in HTML, although no major browser supports it. It is illegal (NO) in XML and thus in XHTML.

UNCLOSED defines whether or not the start tag needs to be closed. Again, this is not the same as whether or not the element needs to be closed. Here is an application of an unclosed start tag: <div<p>This is a P inside a DIV.</p></div>. The end of the start tag is assumed by the beginning of the next tag. This is legal (YES) in HTML, although it is poorly supported. It is illegal (NO) in XML and thus XHTML.

NETENABL defines whether or not the start tag may use Null End Tag (NET) notation. This replaces the start tag’s closing delimiter and the end tag with special single-character delimiters. Here is an example of an element using a Null End Tag: <title/This is the title of the page/. The value for this feature may be NO, ALL (which is implied if SHORTTAG is simply YES), or IMMEDNET. Null End Tags are always legal (ALL) in HTML, although, as you might have guessed, no major browser supports it. In XML, it is IMMEDNET, meaning that it is supported, but only when the Null End Tag closing delimiter is immediately after the opening delimiter, which in turn means that the element must have no contents. XML also uses a different character for the closing Null End Tag delimiter: “>“. Therefore, a Null End Tag in XML looks like this: <br/>, which people familiar with XML should recognize.

ENDTAG deals with end tags and contains two features: EMPTY and UNCLOSED.

This EMPTY is similar to the one in STARTTAG, but it applies to end tags and it is assumed to close the most recent element that is open. For example, if you have <div>Foo <span>bar</span> baz</>, the empty end tag closes the div element. This is legal (YES) in HTML, but illegal (NO) in XML and thus XHTML.

This UNCLOSED is also similar to the one in STARTTAG, and applies to end tags. The end of the end tag is assumed by the beginning of the next tag. For example, <div><div>Foo</div<p>Bar</p</div>. This is legal (YES) in HTML, but illegal (NO) in XML and thus XHTML.

ATTRIB deals with attributes and contains three features: DEFAULT, OMITNAME, and VALUE.

DEFAULT defines whether or not attributes may have default values that are defined in the DTD. This is enabled (YES) in both HTML and XML and thus XHTML.

OMITNAME defines whether or not attribute names may be omitted. In such a case, the given attribute value will be used for both the attribute name and attribute value. For example, <input type="checkbox" checked> is equivalent to <input type="checkbox" checked="checked">. This is legal (YES) in HTML, although several major browsers don’t treat it literally in some areas like CSS attribute selectors. It is illegal (NO) in XML and thus XHTML.

VALUE defines whether or not attribute values may be specified without delimiting quotation marks if the value uses certain ranges of characters. This is legal (YES) in HTML, but it is illegal (NO) in XML and thus XHTML.

So here’s the summary: HTML has a simple YES for both OMITTAG and SHORTTAG, meaning all of the above features are allowed. XML has NO for OMITTAG and has a feature breakdown for SHORTTAG, amounting to YES for ATTRIB DEFAULT, IMMEDNET for NETENABL, and NO for everything else.

Although it is technically legal to write your own SGML declaration right into an HTML document, extremely few user agents will even recognize it, let alone support it correctly. It is strictly illegal to write your own SGML declaration into an XML document. SHORTTAG and OMITTAG aren’t options you can toggle to please the browser, they are inherent traits of HTML and XML and valid documents must conform to those rules.

Null end tags in XHTML

Wednesday, August 9th, 2006

I have mentioned here and there that XHTML (and XML in general) wasn’t designed to support SGML null end tags. This isn’t completely true. XML supports a restricted and altered form of null end tags, and in fact they are used all the time.

Null end tags are a way to abbreviate an end tag to a single character. They are not supported by most common HTML user agents, but they do exist in HTML’s profile of SGML and there are HTML user agents that support them. In HTML and the default SGML profile, null end tags look like this:

<title/This is the title of the page/

For fully compliant HTML user agents, the above is equivalent to the following:

<title>This is the title of the page</title>

As you can see, the contents of the element are surrounded by “/” characters, which are used more or less like the quotes used for attribute values. If an SGML profile and DTD requires a certain element’s end tag to be omitted, only one slash is relevant for the element (any further slashes will be treated as character data). For example, the following tag is valid in HTML:

<img src="image.png" alt="An image"/

Although it doesn’t save any characters and isn’t widely supported, it is perfectly legal according to the standard. This is where I have discussed problems with XHTML. The above isn’t legal in XHTML, but the following is the closest equivalent:

<img src="image.png" alt="An image"/>

An XHTML user agent would see the above as a single img tag, but a fully compliant HTML user agent would see it as a shortened null end tag like the previous example but with a “>” character after it. The “>” character would be seen as regular character data and would display on the page itself. Despite common practice, a space before the “/” character wouldn’t change this.

I have said that this issue is due to XHTML/XML not supporting null end tags. However, it’s more accurate to say that XML doesn’t support null end tags in the same way as HTML. Rather than the contents being surrounded by two slashes, they are surrounded by a slash and a greater-than sign (/ ... >) with the additional constraint that they may only be used when the contents are empty. So the second img example is actually XML’s version of null end tags: the start tag ends with the “/” character and the end tag is represented by the “>” character. Because end tags may not be omitted in XML, the “>” character is always required, and because the null end tag rule in XML is defined as “IMMEDNET” (explained below), it must close immediately after it is started, so there may be no actual content.

Although the specifications don’t clearly discuss these issues, they are a result of the respective standards’ SGML declarations that define the profile of SGML used. See the HTML SGML declaration and the XML SGML declaration. These declarations are automatically assumed by the browser when they are given hint to treat the page as HTML or XML (such as via the content type). The SGML declaration defines the most basic level of how the SGML document is written. It defines which characters define tags, marked sections, character references, processing instructions, etc., what kinds of shorthand features may be used, possibly some default character entities that are available regardless of the DTD, and other lexical aspects of the document. The SGML declaration is applied on top of a default profile, called the “reference concrete syntax” that is defined in the SGML standard itself.

The HTML SGML declaration isn’t very big because it mostly uses SGML’s defaults. The defaults include “/” for syntax.delim.net, meaning that null end tag contents are delimited by “/” characters. XML uses “>” for syntax.delim.net, plus “/” for syntax.delim.nestc. Nestc is an extension to the original SGML standard that provides a different value for the null end tag delimiter that finishes the start tag. XML uses other extensions, such as more specific options in the “features” section. HTML enables features.minimize.shorttag, which allows shorthand constructs like null end tags, while XML specifically has features.minimize.shorttag.starttag.netenabl set to “IMMEDNET” which, as mentioned above, enables null end tags with the restriction that they must close immediately after opening.

The reason XML was designed to support this form of null end tags was to reduce the potential clutter caused by a large number of empty elements. The null end tag delimiters were altered so that the null end tags don’t look too alien for people who are used to the widely supported parts of the HTML standard. They were designed to look like regular start tags with a simple slash before the end, which reminds people of the function of end tags. In this way, they managed to design XML to be strict, efficient, intuitive, and compatible with modern SGML user agents that know where to find the SGML declaration.