Friday, August 22, 2008

Some sample templates for use with LexEv

If your XML has been parsed using LexEv, here are some sample templates for handling the LexEv markup.

To output an entity reference:



<xsl:template match="lexev:entity">
<xsl:value-of disable-output-escaping="yes" select="concat('&amp;', @name, ';')"/>
</xsl:template>



To process a CDATA section as markup:


<xsl:template match="lexev:cdata">
<xsl:apply-templates/>
</xsl:template>


To output a DOCTYPE from the processing instructions:

In XSLT 1.0 the doctype-public and doctype-system attributes on xsl:output are static and need to be known at compile time, which means I'm afraid you have to do this:


<xsl:template match="/">
<xsl:value-of disable-output-escaping="yes"
select="concat('&lt;!DOCTYPE ', name(/*), '&#xa; PUBLIC &quot;',
processing-instruction('doctype-public'), '&quot; &quot;',
processing-instruction('doctype-system'), '&quot;&gt;')"/>
<xsl:apply-templates/>
</xsl:template>


In XSLT 2.0 you can use xsl:result-document where the doctype-public and doctype-system are AVTs which mean their values can be determined at runtime:


<xsl:template match="/">
<xsl:result-document
doctype-public="{processing-instruction('doctype-public')}"
doctype-system="{processing-instruction('doctype-system')}">
<xsl:apply-templates/>
</xsl:result-document>
</xsl:template>

Thursday, August 21, 2008

LexEv XMLReader - converts lexical events into markup

It's often a requirement to preserve entity references through to the output (which are usually lost during parsing) or to process the contents of CDATA sections as markup. The Lexical Event XMLReader wraps the standard XMLReader to convert lexical events into markup so that they can be processed. Typical uses are:

  • Converting cdata sections into markup:


    <![CDATA[ &lt;p&gt; a para &lt;p&gt; ]]>

    to:

    <lexev:cdata> <p> a para </p> </lexev:cdata>



  • Preserving entity references:


    hello&mdash;world

    is converted to:

    hello<lexev:entity name="mdash">—</lexev:entity>world


  • Preserving the doctype declaration:


    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    is converted to processing instructions:

    <?doctype-public -//W3C//DTD XHTML 1.0 Transitional//EN?>
    <?doctype-system http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd?>


  • Marking up comments:


    <!-- a comment -->

    is converted to:

    <lexev:comment> a comment </lexev:comment>


To use LexEvXMLReader with Saxon:


java -cp saxon9.jar;LexEvXMLReader.jar net.sf.saxon.Transform -x:com.andrewjwelch.lexev.LexEvXMLReader input.xml
stylesheet.xslt


Make sure LexEvXMLReader.jar is on the classpath, and then tell Saxon to use it with the -x switch (copy and paste this line -x:com.andrewjwelch.lexev.LexEvXMLReader)


To use LexEvXMLReader from Java:

XMLReader xmlReader = new LexEvXMLReader();


You can control the following features of LexEv:


  • enable/disable the marking up of entity references

  • enable/disable the marking up of CDATA sections

  • set the default namespace for the CDATA section markup

  • enable/disable the reporting of the DOCTYPE

  • enable/disable the marking up of comments


You can set these through the API (if you are including LexEv in an application), or from the command line using the following system properties:


  • com.andrewjwelch.lexev.inline-entities

  • com.andrewjwelch.lexev.cdata

  • com.andrewjwelch.doctype.cdataNamespace

  • com.andrewjwelch.lexev.doctype

  • com.andrewjwelch.lexev.comments


For example to set a system property from the command line you would use: -Dcom.andrewjwelch.lexev.comments=false


For support, suggestions and licensing, email lexev@andrewjwelch.com

Friday, July 18, 2008

Kernow 1.6.1

Kernow 1.6.1 (beta) is now availble both as a download and via web start.

Notable things in this release:

- Line numbers on the editor panes in the sandboxes (thanks to a new version of Bounce). You might not think so, but getting line numbers down the side of the editor pane is really involved. It's like block indenting (pressing tab or shift-tab when a block of text is selected) in that it's very low level and requires a lot of coding. Why it's not an intergral part of the editor pane I don't know...

- Improved the syntax-checking-as-you-type and highlighting, and added the ability to disable it.

- The output area is now also a JEditorPane using Bounce so it supports tag highlighting. This might slow things down because now it's an HTML document where every addition is inserted at the end of the document, instead of just appending to a JTextArea... if this proves to be A Bad Thing I'll revert it back to a plain old text area with plain text.

- You can now select which tabs are visible (in options -> tabs) so if you never use certain tabs (like Batch or Schematron) you can remove them.

- If you have Saxon SA you can use XML Schema 1.1 (options -> validation)

- Improved the parameters dialog to make it less fiddly to enter params

- Slight graphical tweaks and likely other things that I've forgotten...

Thursday, July 17, 2008

The Nimbus Look and Feel

This is "Nimbus" - the new look and feel that comes with Java 6 Update 10. This is a cross platform l&f which means it should look the same on all platforms. Kernow currently uses the "platform default" look and feel so it should look like a native app on the platform it's run on, but it's hard to make sure it looks right - often what looks ok on Windows will have obscured buttons on Linux... something I should've fixed but never did.

Anyway, what do you think?

Friday, July 11, 2008

Validating co-constrains in XML Schema 1.1 using xs:alternative

Rather than mess around with loads of assertions to check your co-constraints, XML Schema 1.1 introduces the xs:alternative instruction which allows you to change the type used to validate the element based on some condition. Instead of defining one type and then adding assertions to check the variations, just define one type per variation, then assign that type based on the condition.

To do this you first have to define a default type, then define types for each variation by restricting that type. To choose between them, use xs:alternative as a child of xs:element. Here's an example of a co-constraint - this and that are allowed based on the value of the type attribute of node - and how to validate it:
<root>
<node type="A">
<this/>
</node>
<node type="B">
<that/>
</node>
</root>

Here's the schema:
<xs:schema 
xmlns:xs="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified">

<xs:element name="root" type="root"/>
<xs:element name="node" type="node">
<xs:alternative type="node-type-A" test="@type = 'A'"/>
<xs:alternative type="node-type-B" test="@type = 'B'"/>
</xs:element>

<xs:element name="this"/>
<xs:element name="that"/>

<xs:complexType name="root">
<xs:sequence>
<xs:element ref="node" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>

<-- Base type -->
<xs:complexType name="node">
<xs:sequence>
<xs:any/>
</xs:sequence>
<xs:attribute name="type" type="allowed-node-types"/>
</xs:complexType>

<xs:simpleType name="allowed-node-types">
<xs:restriction base="xs:string">
<xs:enumeration value="A"/>
<xs:enumeration value="B"/>
</xs:restriction>
</xs:simpleType>

<-- Type A -->
<xs:complexType name="node-type-A">
<xs:complexContent>
<xs:restriction base="node">
<xs:sequence>
<xs:element ref="this"/>
</xs:sequence>
</xs:restriction>
</xs:complexContent>
</xs:complexType>

<-- Type B -->
<xs:complexType name="node-type-B">
<xs:complexContent>
<xs:restriction base="node">
<xs:sequence>
<xs:element ref="that"/>
</xs:sequence>
</xs:restriction>
</xs:complexContent>
</xs:complexType>
</xs:schema>

I really like this... schema 1.1 will be a joy to use.

Monday, June 16, 2008

XML Schema co-occurrence constraint workaround

Here's a potential workaround for the co-constraint problem in XML Schema.

Given some XML:

<elem type="typeA">
<typeA/>
</elem>

<elem type="typeB">
<typeB/>
</elem>

...the problem is you can't constrain the contents of <elem> based on the value of the type attribute.

You can do it though, if you add an xsi:type attribute to it to explicitly set its type:

<elem type="typeA" xsi:type="elem_typeA">
<typeA/>
</elem>

<elem type="typeB" xsi:type="elem_typeB">
<typeB/>
</elem>

with suitable type definitions in the schema:

<xs:complexType name="elem_typeA">
<xs:sequence>
<xs:element ref="typeA"/>
...

<xs:complexType name="elem_typeB">
<xs:sequence>
<xs:element ref="typeB"/>
...

...and when the XML is validated the relevant definition will be used.

This technique is far from ideal as it involves modifying the source, but only in a way which doesn't break it for anyone else. Given the various options for validating co-constraints, this could well be the most straightforward way (at least until 1.1 comes along).

Friday, April 25, 2008

The Scrabble Reference

The Scrabble Reference is an ebook I've created which allows Scrabble players to easily check if words are legal, to suggest longer words or sub-anagrams given a word and to show what words can be made given some letters (up to 15 letters).

There are two versions, TWL and SOWPODS (TWL is used in the USA, Canada, Thailand and Israel and SOWPODS in the rest of the world)

The ebook is in the Mobipocket format which deals with running the ebooks on all devices - PDAs, mobile phones, Blackberries etc so you just need to transform the input to suitable Mobipocket markup, compile the ebook and their client does the rest.

I must admit to not being interested in Scrabble, but of course I am interested in XSLT, and given the list of words allowed in Scrabble I thought I should turn them into a product - one that you can run on your phone seems perfect.

Generating the ebook was very straightforward - list of strings in, markup out... XSLT 2.0 is ideal for the task.

Relative paths and the document() function

A nice gotcha cropped up today on xsl-list...

Relative paths passed to the document() function are resolved against either the XML or the stylesheet depending on what is passed in: a node from the XML will mean the path is resolved against the XML, a string will mean it's resolved against the stylesheet.

The gotcha is this - if you modify this:

document(@path)

to this:

<xsl:variable name="path" select="@path" as="xs:string"/>
...
document($path)


and @path contains a relative path, then you could get a document not found error, or worse if your XML and XSLT are in the same directory, you won't notice...

Friday, February 29, 2008

XML Schema - element with text and attributes

For some reason I always forget how to define an element that contains only text but also has attributes. Perhaps it's because it's so verbose, or so non-intuitive for something so simple, who knows. Either way it's something that needs to be committed to memory...

So the element:

<foo bar="bar" baz="baz"/>

is described using:

<xs:complexType name="foo">
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="bar" type="xs:string"/>
<xs:attribute name="baz" type="xs:string"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>

nice!

Wednesday, February 27, 2008

schema-aware.com

I've created a new website schema-aware.com which is inteded to contain lots of examples of schema-aware XSLT and XQuery. I've started it off with half a dozen or so and hope to add more as time goes on.

I also intend to add a few articles about schema-aware transforms - how the run them from the command line, from Java, the various flags involved, how to write schemas to allow you to use the types in your XSLT etc... My intentions are good, we'll have to see how much I actually do.

Thursday, January 31, 2008

Kernow 1.6 beta

I've uploaded a new version of Kernow (1.6) which contains the rather nice "XSLT Sandbox" tab. This tab has the XML pane on the left, the XSLT pane on the right and a transform button... and that's it. It's intended for anyone who wants to quickly try something out without the hassle of files, the command line or starting up a proper IDE. It does error checking as you type and highlights any problems.

It's available as the usual download from Sourceforge, or through Java Web Start. If you already run the JWS version it should automatically update itself (any problems just re-install it). I've finally figured out the temperamental errors with the JWS version - it turns out the ant jars included with Kernow were already signed by a previous version and so weren't being signed again, but because they were marked as "lazy" in the jnlp the JWS version would start anyway. (You can tell if a jar has been signed by looking for *.SF and *.DSA in the META-INF directory.)

The other improvement I'm pleased to have sorted out is that kernow.config (where all of the settings and combobox history are saved) is now stored in a directory called .kernow in your user.home (which is one up from My Documents in XP). Previously it would've been stored on the deskop for the JWS version which is really annoying - sorry about that. As usual it was a 10 minute job, but just took a while to get around to.

I've also separated out the SOAP and eXists extension functions into a separate package, so there's no longer the need for the largish eXist.jar, xmldb.jar and log4j.jar jars to be part of the download.

I'll release a non-beta version in a few weeks if no bugs are reported, and I've got around to updating all of the documentation.

Parsing XML into Java 5 Enums

Often when parsing XML into pojos I just resort to writing my own SAX based parser. It can be long winded but I think gives you the greatest flexibility and control over how you get from the XML to objects you can process.

One example is with Java 5's Enums, which are great. Given the kind of XML fragment where currencies are represented using internal codes:

<currency refid="001"/> <!-- 001 is Sterling -->
<currency refid="002"/> <!-- 002 is Euros -->
<currency refid="003"/> <!-- 003 is United States Dollars -->


You can represent each currency element with an Enum, which contains extra fields for the additional information:

public enum Currency {

GBP ("001", "GBP", "Sterling"),
USD ("002", "EUR", "Euros"),
USD ("003", "USD", "United States Dollar");

private final String refId;
private final String code;
private final String desc;

Currency(String refId, String code, String desc) {
this.refId = refId;
this.code = code;
this.desc = desc;
}

public String refId() {
return refId;
}

public String code() {
return code;
}

public String desc() {
return desc;
}

// Returns the enum based on it's property rather than its name
// (This loop could possibly be replaced with a static map, but be aware
// that static member variables are initialized *after* the enum and therefore
// aren't available to the constructor, so you'd need a static block.
public static Currency getTypeByRefId(String refId) {
for (Currency type : Currency.values()) {
if (type.refId().equals(refId)) {
return type;
}
}

throw new IllegalArgumentException("Don't have enum for: " + refId);
}
}


Notice how each enum calls its own contructor with the 3 parameters - the refId, the code, and the description.

You parse the XML into the enum by calling Currency.getTypeByRefId(String refId) passing in the @refid from the XML. The benefit of using the Enum is that you can then do things like:

if (currency.equals(Currency.GBP))

which is nice and clear, while at the same time being able to call currency.refId() and currency.desc() to get to the other values.

The drawback is that because static member variables are initialized after the enum, you can't create a HashMap and fill it for a faster lookup later (unless you use a static block). Instead you have to loop through all known values() for the enum given a refId. Although it feels wrong to loop, the worst case is only the size the of enum so I don't think it's too bad.

Tuesday, January 22, 2008

Portability of a stylesheet across schema-aware and non-schema-aware processors

I came across this today, which I thought was really cool and worth a post. It basically allows you to code a transform that is only schema-aware if a schema-aware processor is running it, otherwise it's just a standard transform.

In this case I want to do input and output validation, so first I sort out the schemas:

<xsl:import-schema schema-location="input.xsd"
    namespace="http://www.foo.com"
    use-when="system-property('xsl:is-schema-aware')='yes'"/>

<xsl:import-schema schema-location="output.xsd"
    use-when="system-property('xsl:is-schema-aware')='yes'"/>

Note the use-when...

Next define two root matching templates, one for schema-aware, one for basic:

<xsl:template match="/"
    use-when="system-property('xsl:is-schema-aware')='yes'"
    priority="2">
    
    <xsl:variable name="input" as="document-node()">
        <xsl:document validation="strict">
            <xsl:copy-of select="/"/>
        </xsl:document>
    </xsl:variable>
    
    <xsl:result-document validation="strict">
        <xsl:apply-templates select="$input/the-root-elem"/>
    </xsl:result-document>
    
</xsl:template>

<xsl:template match="/">
    <xsl:apply-templates select="the-root-elem"/>
</xsl:template>    
    
<xsl:template match="the-root-elem">
    ...
</xsl:template>

The root matching template for schema-aware processing uses xsl:document to validate the input, and xsl:result-document to validate the output. Validation can also be controlled from outside the transform, but this way forces it on.

I think this is great :)

The identity transform for XSLT 2.0

I was looking at the standard identity transform the other day and realised that for nodes other than elements, the call to apply-templates is redundant.

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

Also, although it might be intuitive to think that attributes have separate nodes for their name and value, they are in fact a single node that's copied in it's entirety by xsl:copy.

I raised this on xsl-list and suggested seperating out the attribute into a template of its own with just xsl:copy for its body:

<xsl:template match="node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="@*">
  <xsl:copy/>
</xsl:template>


Mike Kay suggested a more logical version would be:

<xsl:template match="element()">
  <xsl:copy>
    <xsl:apply-templates select="@*,node()"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="attribute()|text()|comment()|processing-instruction()">
  <xsl:copy/>
</xsl:template>

This turned out to be ideal for three reasons:

- the comma between @* and node() will mean the selected nodes will be processed in that order, removing the sorting and deduplication that takes place with union |
- apply-templates is only called when it will have an effect
- it's clearer that attributes are leaf nodes

So there it is... the identity transform for XSLT 2.0