Web Devout tidings


Archive for July 1st, 2007

HTML good practice checker

Sunday, July 1st, 2007

Do you like clean markup? Do you use HTML and still prefer to quote all your attribute values, use lower-case tag names, and generally follow good clean markup practices? Do you wish you could force the HTML Validator to be even more strict so you could quickly identify stray XHTML-style self-closing tags in your HTML and other issues that it usually ignores?

If so, then you may find my new HTML good practice checker useful. It sets up a custom SGML declaration (for markup parsing rules) and DTD (for document structure rules) which instruct the W3C HTML Validator to be more strict with your document.

Here is a partial list of the new rules enforced:

  • All tag and attribute names must be lower-case.
  • All attribute values must be quoted.
  • Declarations are case-sensitive like in XML.
  • SGML Null End Tags (NET) are not allowed. This means that the validator will recognize that a <br /> in an HTML document is a problem.
  • End tags must be used on all non-empty elements. Note: If an end tag is forbidden in normal HTML, it’s still forbidden here.
  • Start tags must be used on all elements.
  • You may not write <tr> tags directly inside the table contents; you must include them in a tbody. In fact, in terms of document structure, tr was never truly allowed as a child of table in HTML. They were normally assumed to be within a tbody element with omitted start and end tags. So this rule is actually just a natural consequence of the above two rules. Note that HTML’s behavior is different from XHTML, where tr actually is allowed as a child of table, and the good practice rule of an explicit tbody element improves consistency between HTML and XHTML.
  • Nested tables are not allowed.
  • Unclosed tags and empty tags (obscure and poorly-supported SGML shorthand rules) are no longer allowed.
  • Attributes may no longer use minimized form (for example, the disabled attribute must be written disabled="disabled").
  • Hexadecimal character references must use a lower-case “x” like in XML.
  • The following presentational elements may not be used: tt, i, b, big, small.
  • The q element may not be used, due to major unresolvable compatibility issues.
  • The width and height attributes are required on img elements.
  • The name attribute has been removed on the a element. You should use id instead.
  • The following attributes were removed from the table element: width, border, frame, rules, cellspacing, cellpadding, datapagesize (a reserved attribute).
  • The following attributes were removed from all other table-related elements: width, align, char, charoff, valign.
  • On the script element, the reserved event and for attributes have been removed.
  • In order to avoid issues when user agents confuse UTF-8 and ISO-8859-1, characters above &#126; are no longer allowed to be written directly in the document. You should use character references for them.

I’m always open to feedback. For the most part, the things this system can check are currently limited to rules you can specify in the SGML declaration and DTD. Keep in mind that this system is new and it’s possible that there are bugs. If you come across any, please let me know.