MiniML

From CommerceNet Wiki

Jump to: navigation, search

What's the least you have to do to stuff XML into XHTML and still squeeze it back out?

Contents

[edit] Goals

  • Convert XML into XHTML

Because, every once in a while, you still have to present a pure, beautiful gem of XML in a lousy browser...

  • Convert XHTML into XML

Because it's not very CLASSy to force XML-heads to parse all this to find John Doe's email address:

<script><!--
 function qs(el) {if (window.RegExp && window.encodeURIComponent
 ) {var ue=el.href;var qe=encodeURIComponent(document.f.q.value);
 if(ue.indexOf("q=")!=-1){el.href=ue.replace(new RegExp("q=[^&$]*"),
 "q="+qe);}else{el.href=ue+"&q="+qe;}}return 1;}
 // -->
 </script><table border=0 cellspacing=0 cellpadding=4><tr><td
 nowrap class='vcard fn n'><font size=-1 class=email
 ><a class="work pref" href= mailto:john.doe@tempuri.org><b>
 <span class=Given-Name>John</span> <font size=+12 class=
 Family-Name>Doe</font></b></a>
 

... after all, what's really so terrible about this?

 <vcard>
  <fn>John Doe</fn>
  <n>
   <Family-Name>Doe</Family-Name>
   <Given-Name>John</Given-Name>
  </n>
  <email>
   john.doe@tempuri.org <work/> <pref/>
  </email>
 </vcard>
 

There's one tool chain that works with HTML -- browsers, DOM scripting, search engines -- and an almost-entirely-different tool chain for "XML" -- web services, parsers, query languages. It would be nice to switch back and forth between representations that are well-tuned for native speakers of each tongue. Microformats are a significant step in this direction. The constraints are:

  • Pretty - keeping the generated versions of the fragments compact and 'reasonable'
  • Arbitrary - staying domain-independent and avoiding special cases
  • Equivalent - make sure that we convert as much information as possible in either direction

[edit] Rules

  1. Element names map to class names: tag -> .--tag
  2. Attributes map to definition lists: @attr=value ... -> dl.-attrs where each dt contains attr and the corresponding dd contains value; pairs must be two-level sorted in Unicode order with attr major and value minor.

Why should this work? '-' is forbidden as the first character of an XML element, but legitimate as an XHTML class attribute word.

[edit] Caveats

  • for Rule 1, the element you put the class .--tag on is either a div or span depending on whether it will have any child elements in XHTML.
    • non-default XHTML elements have an additional @miniml:element=xhtml:tag in XML
    • see below for a discussion of case-sensitivity in older versions of HTML
  • for Rule 2, there are exceptions around known XSD data types &c.

In general, we want the mapping to be bidirectionally roundtrippable while maintaining XML document equivalence. That implies that comments, cdata, entities, DTDs and PIs may be lost in transation...

[edit] Example

XML MiniML XHTML
 <A
  xmlns:xhtml="http://www.w3.org/1999/xhtml"
  xmlns:miniml="..."
  xmlns:xsd="...">
  foo
  <b/><Temp xsi:type=xsd:float>98.7</temp>
  <C F=F D=E> bar</C>
  <xhtml:abbr
   xhtml:title="the day before yesterday">
      the other day
  </xhtml:abbr>
  <qq
   miniml:content="
      International
      Qu&lt;b&gt;ag&lt;/b&gt;ga
      Day
      "
   miniml:element="xhtml:abbr"
   xsd:type="iso-8601">
     20050810
  </qq>
  <copyright miniml:element="xhtml:small">
     Copyleft Ↄ⃝ 2003, Me. All Rights Reversed.
  </copyright>
 </A>
 
 <div class="--A"
  xmlns="http://www.w3.org/1999/xhtml"> foo
  <span class="--b"/>
  <span class="--Temp msd-float">98.7</span>
  <div class="--C"><dl class="-attrs">
     <dt>F</dt> <dd>F</dd>
     <dt>D</dt> <dd>E</dd>
  </dl> bar</div>
  <abbr title="the day before yesterday">
     the other day
  </abbr>
  <abbr class="--qq msd-iso-8601" title="20050810">
     International Qu<b>ag</b>ga Day
  </abbr>
  <small class="--copyright">
     Copyleft Ↄ⃝ 2003, Me. All Rights Reversed.
  </small>
 </div>
 
XML MiniML XHTML
<A> foo
 <b/><Temp xsi:type=xsd:float>98.7</temp>
 <C F=F D=E> bar</C>
</A>
 
<div class="--A"> foo
 <span class="--b"/>
   <span class="--Temp msd-float">98.7</span>
  <span class="--C">
    <dl class="-attrs">
     <dt>F</dt> <dd>F</dd>
     <dt>D</dt> <dd>E</dd>
    </dl> bar</span>
</div>
 

The examples are not normative, and in particular are incorrect in that whitespace has been added to aid readability.

[edit] Appendix: Case Study

In XHTML and modern HTML (4.01) the contents of the CLASS attribute are case-sensitive. This means that roundtripping through HTML 3.2-era case-insensitive version of CLASS (see http://www.w3.org/TR/WD-style-970324 for details) will lose information.

However, if we cared about HTML 3.x or earlier we could cludge into the classname .-casemap-tag a casemap, which is a little-endian octal representation with trailing zeros removed of a bitvector with one bit position for each pair of consecutive boundaries in the "character string" (used here in the XML sense) tag, where each bit is set if and only if the substring between the boundaries is invariant under conversion to lower case. This would be good enough to allow full data recovery after locale-independant case-smashing in almost all cases, but there are corner cases in Unicode where this is still insufficient information for full data recovery. Using this notation, an entirely-lowercase tag would map to .--tag, but other cases look a bit odd (and are much harder to correctly hand-edit):

XML element name XHTML class attribute
example --example
Example -1-Example
EXAMPLE -771-EXAMPLE
ExAmPlE -521-ExAmPlE
AnotherExample -102-AnotherExample
A-Final:Example -504-A-Final:Example

Unfortunately this would require every MiniML implementation to have significant knowledge of Unicode combining characters and conjoining character blocks, UTF-16 surrogate pairs (in UTF-16-based text processing environments), cases where there are multiple possible uppercase or lowercase equivalents to a particular character, and introduces a fragile octal component. A more robust solution might use a base32 or hexstring representation of the UTF-8 encoding, but that is completely unreadable and doesn't convey anything useful to the XHTML consumer -- good luck writing a getElementsByClassName()!

[edit] Future notes

  • minimlist
  • minimldict
  • alternate content and abbr @title/@miniml:content
  • Integration with microformats and with "data structures" printed as AXIS-SOAP, MS-SOAP, XML-RPC, YAML (?) etc.
  • Integration with SVG and other XML vocabularies embedded in XHTML fragments.
Personal tools