5.2.2 Supported HTML subset and HTML cleaning
This cleanup process can also be triggered manually, by pressing the "Cleanup edited HTML" button. This can be useful if you pasted content copied from an external application and you want to see how it will look finally. When switching from wysiwyg to source view, the cleanup is also performed.
220.127.116.11 Supported HTML subset
These are the supported tags (or "elements") and attributes:
- <html> and <body>
- <p> with optional attributes align and class. The class attribute is only kept if it has one of the following values: note, warn, fixme
- <pre> with optional class attribute. The class attribute is only kept if has one of the following values: query, include, query-and-include
- <h1>, <h2>, <h3>, <h4>, <h5>
- <a> with required attribute href. If the href attribute is missing, the <a> will be dropped.
- <strong>, <em>, <sup>, <sub>, <tt>, <del>
- <ul>, <ol>, <li>
- <img> with attributes src and align (optional)
- <table> with optional attribute class, <tbody>, <tr>, <td>, <th>. <td> and <th> can have the attributes colspan, rowspan and valign
All tags not listed above will be removed (but their character content will remain). On the block-type elements and images, the id attribute is supported. For the most accurate list of elements and attributes, have a look at the htmlcleaner.xml file (see below).
The supported tags can have any content model as allowed by the HTML DTD, but of course limited to the supported tags. If an element occurs in a location where it is not supported, an ancestor is searched where it is allowed and the containing element(s) are ended, the element inserted, and the containing elements reopened. This happens for example when a <table> occurs inside a <p>.
<b> and <i> are translated to <strong> and <em> respectively, as are <span> tags with font-weight/font-style specifications.
If two or more <br> tags appear after one another, this is translated to a paragraph split. The meaningless <br>'s that the Mozilla editor tends to leave everywhere are removed. Text that appears directly in the <body> is wrapped inside <p> elements.
<br> tags inside <pre> are translated to newlines characters.
The result is serialized as a XML-well-formed HTML document (not XHTML) (UTF-8 encoded). Lines are split at 80 characters (if possible), meaningless whitespace is removed.
All this should also ensure that the resulting HTML is (mostly) the same whether it is edited using Mozilla or Internet Explorer.
The supported tags, attributes and classes for <p> are not hardcoded but can be configured in a file (htmlcleaner.xml). However, making arbitrary adjustments to this file is not supported (the html-cleaner code expects certain tags to be there). Adding new tags or attributes should generally not be a problem, but those won't have the necessary GUI editing support unless you implement that also.