Thoughts on XSLT

Thursday, July 17, 2008

The Nimbus Look and Feel

This is "Nimbus" - the new look and feel that comes with Java 6 Update 10. This is a cross platform l&f which means it should look the same on all platforms. Kernow currently uses the "platform default" look and feel so it should look like a native app on the platform it's run on, but it's hard to make sure it looks right - often what looks ok on Windows will have obscured buttons on Linux... something I should've fixed but never did.

Anyway, what do you think?

Friday, July 11, 2008

Validating co-constrains in XML Schema 1.1 using xs:alternative

Rather than mess around with loads of assertions to check your co-constraints, XML Schema 1.1 introduces the xs:alternative instruction which allows you to change the type used to validate the element based on some condition. Instead of defining one type and then adding assertions to check the variations, just define one type per variation, then assign that type based on the condition.

To do this you first have to define a default type, then define types for each variation by restricting that type. To choose between them, use xs:alternative as a child of xs:element. Here's an example of a co-constraint - this and that are allowed based on the value of the type attribute of node - and how to validate it:

<root>
  <node type="A">
    <this/> 
  </node>
  <node type="B">
    <that/> 
  </node>
</root>

Here's the schema:

<xs:schema 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
  elementFormDefault="qualified">

  <xs:element name="root" type="root"/>
  <xs:element name="node" type="node">
    <xs:alternative type="node-type-A" test="@type = 'A'"/>  
    <xs:alternative type="node-type-B" test="@type = 'B'"/>  
  </xs:element>
  <xs:element name="this"/>
  <xs:element name="that"/>
  
  <xs:complexType name="root">
    <xs:sequence>
      <xs:element ref="node" maxOccurs="unbounded"/>
    </xs:sequence>
  </xs:complexType>
  
  <-- Base type -->
  <xs:complexType name="node">
    <xs:sequence>
      <xs:any/>  
    </xs:sequence>
    <xs:attribute name="type" type="allowed-node-types"/>
  </xs:complexType>

  <xs:simpleType name="allowed-node-types">
    <xs:restriction base="xs:string">
      <xs:enumeration value="A"/>
      <xs:enumeration value="B"/>
    </xs:restriction>
  </xs:simpleType>      
  
  <-- Type A -->
  <xs:complexType name="node-type-A">
    <xs:complexContent>
      <xs:restriction base="node">
        <xs:sequence>
          <xs:element ref="this"/>  
        </xs:sequence>  
      </xs:restriction>  
    </xs:complexContent>
  </xs:complexType>
  
  <-- Type B -->
  <xs:complexType name="node-type-B">
    <xs:complexContent>
      <xs:restriction base="node">
        <xs:sequence>
          <xs:element ref="that"/>  
        </xs:sequence>  
      </xs:restriction>  
    </xs:complexContent>
  </xs:complexType>  
</xs:schema>

I really like this... schema 1.1 will be a joy to use.

Monday, June 16, 2008

XML Schema co-occurrence constraint workaround

Here's a potential workaround for the co-constraint problem in XML Schema.

Given some XML:


<elem type="typeA">
  <typeA/>
</elem>

<elem type="typeB">
  <typeB/>
</elem>

...the problem is you can't constrain the contents of <elem> based on the value of the type attribute.

You can do it though, if you add an xsi:type attribute to it to explicitly set its type:


<elem type="typeA" xsi:type="elem_typeA">
  <typeA/>
</elem>

<elem type="typeB" xsi:type="elem_typeB">
  <typeB/>
</elem>

with suitable type definitions in the schema:


<xs:complexType name="elem_typeA">
  <xs:sequence>
    <xs:element ref="typeA"/>
    ...

<xs:complexType name="elem_typeB">
  <xs:sequence>
    <xs:element ref="typeB"/>
    ...

...and when the XML is validated the relevant definition will be used.

This technique is far from ideal as it involves modifying the source, but only in a way which doesn't break it for anyone else. Given the various options for validating co-constraints, this could well be the most straightforward way (at least until 1.1 comes along).

Friday, April 25, 2008

The Scrabble Reference

The Scrabble Reference is an ebook I've created which allows Scrabble players to easily check if words are legal, to suggest longer words or sub-anagrams given a word and to show what words can be made given some letters (up to 15 letters).

There are two versions, TWL and SOWPODS (TWL is used in the USA, Canada, Thailand and Israel and SOWPODS in the rest of the world)

The ebook is in the Mobipocket format which deals with running the ebooks on all devices - PDAs, mobile phones, Blackberries etc so you just need to transform the input to suitable Mobipocket markup, compile the ebook and their client does the rest.

I must admit to not being interested in Scrabble, but of course I am interested in XSLT, and given the list of words allowed in Scrabble I thought I should turn them into a product - one that you can run on your phone seems perfect.

Generating the ebook was very straightforward - list of strings in, markup out... XSLT 2.0 is ideal for the task.

Relative paths and the document() function

A nice gotcha cropped up today on xsl-list...

Relative paths passed to the document() function are resolved against either the XML or the stylesheet depending on what is passed in: a node from the XML will mean the path is resolved against the XML, a string will mean it's resolved against the stylesheet.

The gotcha is this - if you modify this:

document(@path)

to this:

<xsl:variable name="path" select="@path" as="xs:string"/>
...
document($path)

and @path contains a relative path, then you could get a document not found error, or worse if your XML and XSLT are in the same directory, you won't notice...

Friday, February 29, 2008

XML Schema - element with text and attributes

For some reason I always forget how to define an element that contains only text but also has attributes. Perhaps it's because it's so verbose, or so non-intuitive for something so simple, who knows. Either way it's something that needs to be committed to memory...

So the element:


<foo bar="bar" baz="baz"/>

is described using:


<xs:complexType name="foo">
    <xs:simpleContent>
        <xs:extension base="xs:string">
            <xs:attribute name="bar" type="xs:string"/>
            <xs:attribute name="baz" type="xs:string"/>
        </xs:extension>
    </xs:simpleContent>                
</xs:complexType>

nice!

Wednesday, February 27, 2008

schema-aware.com

I've created a new website schema-aware.com which is inteded to contain lots of examples of schema-aware XSLT and XQuery. I've started it off with half a dozen or so and hope to add more as time goes on.

I also intend to add a few articles about schema-aware transforms - how the run them from the command line, from Java, the various flags involved, how to write schemas to allow you to use the types in your XSLT etc... My intentions are good, we'll have to see how much I actually do.

Thursday, January 31, 2008

Kernow 1.6 beta

I've uploaded a new version of Kernow (1.6) which contains the rather nice "XSLT Sandbox" tab. This tab has the XML pane on the left, the XSLT pane on the right and a transform button... and that's it. It's intended for anyone who wants to quickly try something out without the hassle of files, the command line or starting up a proper IDE. It does error checking as you type and highlights any problems.

It's available as the usual download from Sourceforge, or through Java Web Start. If you already run the JWS version it should automatically update itself (any problems just re-install it). I've finally figured out the temperamental errors with the JWS version - it turns out the ant jars included with Kernow were already signed by a previous version and so weren't being signed again, but because they were marked as "lazy" in the jnlp the JWS version would start anyway. (You can tell if a jar has been signed by looking for *.SF and *.DSA in the META-INF directory.)

The other improvement I'm pleased to have sorted out is that kernow.config (where all of the settings and combobox history are saved) is now stored in a directory called .kernow in your user.home (which is one up from My Documents in XP). Previously it would've been stored on the deskop for the JWS version which is really annoying - sorry about that. As usual it was a 10 minute job, but just took a while to get around to.

I've also separated out the SOAP and eXists extension functions into a separate package, so there's no longer the need for the largish eXist.jar, xmldb.jar and log4j.jar jars to be part of the download.

I'll release a non-beta version in a few weeks if no bugs are reported, and I've got around to updating all of the documentation.

Parsing XML into Java 5 Enums

Often when parsing XML into pojos I just resort to writing my own SAX based parser. It can be long winded but I think gives you the greatest flexibility and control over how you get from the XML to objects you can process.

One example is with Java 5's Enums, which are great. Given the kind of XML fragment where currencies are represented using internal codes:

<currency refid="001"/> <!-- 001 is Sterling -->
<currency refid="002"/> <!-- 002 is Euros -->
<currency refid="003"/> <!-- 003 is United States Dollars -->

You can represent each currency element with an Enum, which contains extra fields for the additional information:

public enum Currency {

    GBP ("001", "GBP", "Sterling"),
    USD ("002", "EUR", "Euros"),
    USD ("003", "USD", "United States Dollar");    

    private final String refId;
    private final String code;
    private final String desc;

    Currency(String refId, String code, String desc) {
        this.refId = refId;
        this.code = code;
        this.desc = desc;
    }

    public String refId() {
        return refId;
    }

    public String code() {
        return code;
    }

    public String desc() {
        return desc;
    }

    // Returns the enum based on it's property rather than its name
    // (This loop could possibly be replaced with a static map, but be aware
    //  that static member variables are initialized *after* the enum and therefore
    //  aren't available to the constructor, so you'd need a static block.  
    public static Currency getTypeByRefId(String refId) {
        for (Currency type : Currency.values()) {
            if (type.refId().equals(refId)) {
                return type;
            }
        }

        throw new IllegalArgumentException("Don't have enum for: " + refId);
    }    
}

Notice how each enum calls its own contructor with the 3 parameters - the refId, the code, and the description.

You parse the XML into the enum by calling Currency.getTypeByRefId(String refId) passing in the @refid from the XML. The benefit of using the Enum is that you can then do things like:

if (currency.equals(Currency.GBP))

which is nice and clear, while at the same time being able to call currency.refId() and currency.desc() to get to the other values.

The drawback is that because static member variables are initialized after the enum, you can't create a HashMap and fill it for a faster lookup later (unless you use a static block). Instead you have to loop through all known values() for the enum given a refId. Although it feels wrong to loop, the worst case is only the size the of enum so I don't think it's too bad.

Tuesday, January 22, 2008

Portability of a stylesheet across schema-aware and non-schema-aware processors

I came across this today, which I thought was really cool and worth a post. It basically allows you to code a transform that is only schema-aware if a schema-aware processor is running it, otherwise it's just a standard transform.

In this case I want to do input and output validation, so first I sort out the schemas:

<xsl:import-schema schema-location="input.xsd"
    namespace="http://www.foo.com"
    use-when="system-property('xsl:is-schema-aware')='yes'"/>

<xsl:import-schema schema-location="output.xsd"
    use-when="system-property('xsl:is-schema-aware')='yes'"/>

Note the use-when...

Next define two root matching templates, one for schema-aware, one for basic:

<xsl:template match="/"
    use-when="system-property('xsl:is-schema-aware')='yes'"
    priority="2">

    <xsl:variable name="input" as="document-node()">
        <xsl:document validation="strict">
            <xsl:copy-of select="/"/>
        </xsl:document>
    </xsl:variable>

    <xsl:result-document validation="strict">
        <xsl:apply-templates select="$input/the-root-elem"/>
    </xsl:result-document>

</xsl:template>

<xsl:template match="/">
    <xsl:apply-templates select="the-root-elem"/>
</xsl:template>

<xsl:template match="the-root-elem">
    ...
</xsl:template>

The root matching template for schema-aware processing uses xsl:document to validate the input, and xsl:result-document to validate the output. Validation can also be controlled from outside the transform, but this way forces it on.

I think this is great :)

The identity transform for XSLT 2.0

I was looking at the standard identity transform the other day and realised that for nodes other than elements, the call to apply-templates is redundant.

<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

Also, although it might be intuitive to think that attributes have separate nodes for their name and value, they are in fact a single node that's copied in it's entirety by xsl:copy.

I raised this on xsl-list and suggested seperating out the attribute into a template of its own with just xsl:copy for its body:

<xsl:template match="node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="@*">
  <xsl:copy/>
</xsl:template>

Mike Kay suggested a more logical version would be:

<xsl:template match="element()">
  <xsl:copy>
    <xsl:apply-templates select="@*,node()"/>
   </xsl:copy>
</xsl:template>

<xsl:template match="attribute()|text()|comment()|processing-instruction()">
  <xsl:copy/>
</xsl:template>

This turned out to be ideal for three reasons:

- the comma between @* and node() will mean the selected nodes will be processed in that order, removing the sorting and deduplication that takes place with union |
- apply-templates is only called when it will have an effect
- it's clearer that attributes are leaf nodes

So there it is... the identity transform for XSLT 2.0

Friday, September 28, 2007

Kernow 1.5.2

I've just uploaded the non-beta version of Kernow 1.5.2.

This version contains:

- French and German translations
- XQuery syntax highlighting and checking as-you-type
- Improved cancelling of Single File and Standalone tasks
- icon and splash screen
- An exe to launch it (for windows users)
- context menus
- comboboxes remember their selected index
- individual combobox entries can be removed by deleting the entry
- other small fixes

Wednesday, September 12, 2007

Connecting to Oracle from XSLT

Today I generated a report by connecting directly to an Oracle database from XSLT, and thought I'd share the basic stylesheet. I used Saxon's SQL extension, which is available when saxon8-sql.jar is on the classpath. As I was connecting to Oracle, I also needed to put ojdcb14.jar on the classpath.

Here's the stylesheet in it's most basic form, formatted for display in this blog.

The important things to note here are:

- The sql prefix is bound to "/net.sf.saxon.sql.SQLElementFactory"
- The driver is "oracle.jdbc.driver.OracleDriver"
- The connection string format is "jdbc:oracle:thin:@1.2.3.4:1234:sid" (note the colon between thin and @ - I missed that first time round) where the IP, port and sid are placeholders for the real values
- remember that saxon8-sql.jar and ojdbc14.jar needed to be on the classpath


<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:sql="/net.sf.saxon.sql.SQLElementFactory"
 exclude-result-prefixes="xs"
 extension-element-prefixes="sql">
 
<xsl:output indent="yes"/>

<xsl:param name="driver" 
  select="'oracle.jdbc.driver.OracleDriver'" 
  as="xs:string"/>

<xsl:param name="database" 
  select="'jdbc:oracle:thin:@123.123.123.123:1234:sid'" 
  as="xs:string"/>

<xsl:param name="user" select="'un'" as="xs:string"/>
<xsl:param name="password" select="'pw'" as="xs:string"/>

<xsl:variable name="connection"
  as="java:java.sql.Connection" 
  xmlns:java="http://saxon.sf.net/java-type">
 
  <sql:connect driver="{$driver}" database="{$database}" 
    user="{$user}" password="{$password}"/>
</xsl:variable>
 
<xsl:template match="/" name="main">
  <root>
    <sql:query connection="$connection" 
      table="some_table" 
      column="*"
      row-tag="row" 
      column-tag="col"/>
  </root>
</xsl:template>

</xsl:stylesheet>

The result of this transform outputs XML in the form:


<root>
  <row>
    <col>data1</col>
    <col>data2</col>
    <col>data3</col>
    <col>data4</col>
  </row>
  ....
</root>

where <root> is the wrapper element, and <row> and <col> are the element names specified in the <sql:query> element.

And that's it - connecting to an Oracle database from within XSLT.

Monday, September 03, 2007

Kernow 1.5.2 beta b2 available

I've just uploaded a new version of Kernow. This one was pretty much already available via Java Web Start, this makes it available via the normal download route.

New features/fixes:

- Added syntax highlighting and checking as-you-type to the XQuery Sandbox tab. Syntax highlighting's provided using Bounce's XMLEditorKit - I'm hoping to use Netbeans' nbEditorKit in a future version which will add line numbers, code completion etc. I've put together the checking-as-you-type and error highlighting using Saxon's error reporting. This is really cool, so I'm planning on doing an equivalent "XSLT Sandbox" soon... perhaps using Netbeans RCP. Not sure yet.

- Added an icon and splashsceen. These came about because JWS and the exe benefit from them. Are they any good? I'm not really a graphics person...

- Kernow.jar is now a proper executable jar, so you can double click it to run Kernow (if you're on a mac for example)

- It's all compiled using Java 1.5, again for mac users where 1.6 isn't supported yet.

It's available here: Kernow

Thursday, August 30, 2007

Kernow now available via Java Web Start

I've been playing around with making Kernow available through Java Web Start. This should be the ideal way to run Kernow as it places a shortcut on your desktop (and in your start menu in Windows) and auto-updates whenever a new version is available.

Reading around it seems Java Web Start has had mixed reviews. Personally I really like it, perhaps because I'm using Java 1.6 and Netbeans 6 M10 which makes it all pretty straightforward (auto jar-signing is really helpful in M10).

Give it a go, let me know what you think: Kernow - Java Web Start

Friday, August 17, 2007

Using XQuery and the slash operator to quickly try out XPaths

This is the coolest thing I've seen in a while...

In XQuery you can constuct a node just by writing it, eg:


<node>I'm a node</node>

and then you can use slash operator to apply an XPath to that node:


<node>I'm a node</node>/data(.)

returns "I'm a node"

The XML doesn't have to be limited to a single node - you can do:


<root>
  <node>foo</node>
  <node>bar</node>
</root>/node/data(.)

...to get "foo bar".

Or:


<root>
  <node>foo</node>
  <node>bar</node>
</root>/node[1]

to get:


<node>foo</node>

Using this technique in combination with Kernow's XQuery Sandbox makes it straightforward to paste in some XML and start trying out some XPaths.

Thursday, August 16, 2007

When a = b and a != b both return true...

In XPath = and != are set operators. That is, they return true if any item on the left hand side returns true when compared with any item on the right hand side. Or in other words:

some x in $seqA, y in $seqB satisfies x op y

...where "op" is = or != (or > or < etc)

To demonstrate this take the two sets ('a', 'b') and ('b', 'c'):

$seqA = $seqB returns true because both sets contains 'b'

$seqA != $seqB returns true because setA contains 'a' which is not equal to 'c' in setB

This catches me out a lot, even though I've been caught out before several times. I really have to think hard about what it is exactly that I'm comparing, and still end up getting it wrong.

A simple rules to follow is "never use != where both sides are sequences of more than one item". 99.9% of the time you won't need to, as much as it feels like the right thing to do.

Below are some of the most common operations on sequences, put together for a reference.

The two sequences are ('a', 'b') and ('b', 'c'), which can be defined in XSLT as:

<xsl:variable name="seqA" select="('a', 'b')" as="xs:string+"/>
<xsl:variable name="seqB" select="('b', 'c')" as="xs:string+"/>

or in XQuery as:

let $seqA := ('a', 'b')
let $seqB := ('b', 'c')

Select all items in both sequences

($seqA, $seqB)

Result: a b b c

Select all items in both sequences, eliminating duplicates

distinct-values(($seqA, $seqB))

Result: a b c

Select all items that occur in $seq1 but not $seq2

$seqA[not(. = $seqB)]

Result: a

Select all items that occur in both sequences

$seqA[. = $seqB]

Result: b

Select all items that do not occur in both sequences

($seqA[not(. = $seqB)],$seqB[not(. = $seqA)])

or

($seqA, $seqB)[not(. = $seqA[. = $seqB])]

Result: a c

Determine if both sequences are identical

deep-equal($seqA, $seqB)

Result: false

Test if all items in the sequence are different

count(distinct-values($seqA)) eq count($seqA)

Result: true

Wednesday, August 15, 2007

The Worlds Fastest Sudoku Solution in XSLT 2.0 for the Worlds Hardest Sudoku - Al Escargot

I get a lot of traffic to this blog because of my Sudoku solver. Google analytics tells me most of it lands on the original version that I wrote, and not the optimised version that's now the worlds fastest XSLT 2.0 solution - an issue which this post should hopefully rectify.

The puzzle on the right is apparently the worlds hardest Sudoku puzzle. If I run my solver "Sudoku.xslt" using Kernow's performance testing feature I get this result:

Ran 5 times
Run 1: 328 ms
Run 2: 344 ms
Run 3: 328 ms
Run 4: 391 ms
Run 5: 328 ms
Ignoring first 2 times
Total Time (last 3 runs): 1 second 47 ms
Average Time (last 3 runs): 349 ms

So on my machine the average execution time is 349ms, which is pretty good considering the original version would take minutes for several puzzles. As far as I know this version will solve all puzzles in under a second on my machine (Core 2 duo E6600, 2gb).

How does it do it? This is taken from the web page where it's hosted:

It accepts the puzzle as 81 comma separated integers in the range 0 to 9, with zero representing empty. It works by continuously reducing the number of possible values for each cell, and only when the possible values can't be reduced any further it starts backtracking.

The first phase attempts to populate as many cells of the board based on the start values. For each empty cell it works out the possible values using the "Naked Single", "Hidden Single" and "Naked Tuples" techniques in that order (read here for more on the techniques). Cells where only one possible value exists are populated and then the second phase begins.

The second phase follows this process:

* Find all empty cells and get all the possible values for each cell (using Naked Single and Hidden Single techniques)
* Sort the cells by least possible values first
* Populate the cells with only one possible value
* If more there's more than one value, go through them one by one
* Repeat

This is how it solves the Al Esgargot: A slightly modified version of the solution gives this output with the $verbose parameter set to true. As you can see it's found that it can insert a 1 at position 66 using the static analysis of the puzzle (position 66 is middle-right of the bottom-left group). Next it's decided that there are two possible values 4 and 6 at index 12 (middle-right cell of top-left group), so it tries 4 and continues. With that 4 in place it's found that there's only one possible value at index 39, a 3, so it inserts that and continues. It will keep reducing the possible values based on the current state of the board, inserting the only possible values or trying each one when there are many, until either there are no possible values for an empty cell, or the puzzle is solved.

(the solution is shown below)

Populated single value cell at index 66 with 1
Trying 4 out of a possible 4 6 at index 12
Only one value 3 for index 39
Trying 5 out of a possible 5 7 at index 10
Trying 1 out of a possible 1 9 at index 13
Only one value 9 for index 15
Trying 6 out of a possible 6 7 at index 16
Only one value 7 for index 17
Trying 2 out of a possible 2 4 at index 7
Trying 6 out of a possible 6 8 at index 2
Only one value 8 for index 3
Only one value 2 for index 48
Only one value 6 for index 57
Only one value 5 for index 60
Only one value 9 for index 63
Only one value 5 for index 74
Only one value 1 for index 78
Only one value 3 for index 69
Only one value 5 for index 71
Only one value 6 for index 81
Only one value 4 for index 36
! Cannot go any further !
Trying 8 out of a possible 8 at index 2
Only one value 6 for index 3
Trying 4 out of a possible 4 5 at index 4
Only one value 3 for index 9
Only one value 3 for index 23
Only one value 4 for index 26
Only one value 3 for index 53
Only one value 5 for index 54
Only one value 4 for index 36
Only one value 6 for index 44
Only one value 8 for index 48
Only one value 2 for index 57
Only one value 8 for index 73
Only one value 9 for index 47
Only one value 7 for index 50
Only one value 6 for index 60
Only one value 9 for index 64
! Cannot go any further !
Trying 5 out of a possible 5 at index 4
Only one value 9 for index 47
Only one value 2 for index 67
Only one value 7 for index 49
Only one value 9 for index 64
Only one value 2 for index 80
Only one value 1 for index 32
Only one value 2 for index 48
Only one value 5 for index 50
Only one value 8 for index 58
Only one value 8 for index 73
! Cannot go any further !
Trying 4 out of a possible 4 at index 7
Only one value 3 for index 9
Only one value 4 for index 23
Only one value 3 for index 24
Only one value 2 for index 26
Only one value 3 for index 53
Only one value 5 for index 54
Only one value 8 for index 4
Trying 2 out of a possible 2 6 at index 2
Only one value 6 for index 3
Trying 7 out of a possible 7 8 at index 19
Only one value 8 for index 20
Only one value 7 for index 29
Only one value 9 for index 40
Only one value 7 for index 50
Only one value 6 for index 44
Only one value 2 for index 49
Only one value 2 for index 28
Only one value 8 for index 48
Only one value 2 for index 57
Only one value 5 for index 67
Only one value 7 for index 58
Only one value 8 for index 61
Only one value 3 for index 68
Only one value 2 for index 70
! Cannot go any further !
Trying 8 out of a possible 8 at index 19
Only one value 7 for index 20
Only one value 8 for index 29
Only one value 9 for index 47
Only one value 2 for index 48
Only one value 8 for index 57
Only one value 4 for index 37
Only one value 9 for index 40
Only one value 7 for index 49
! Cannot go any further !
Trying 6 out of a possible 6 at index 2
Only one value 2 for index 3
Only one value 6 for index 57
Trying 7 out of a possible 7 8 at index 19
Only one value 8 for index 20
Trying 2 out of a possible 2 4 at index 28
Only one value 7 for index 29
Only one value 9 for index 47
Only one value 2 for index 49
Only one value 9 for index 40
Only one value 6 for index 44
Only one value 7 for index 50
Only one value 9 for index 59
! Cannot go any further !
Trying 4 out of a possible 4 at index 28
Only one value 9 for index 37
Only one value 4 for index 44
Only one value 6 for index 42
Trying 2 out of a possible 2 7 at index 29
Only one value 7 for index 47
Only one value 2 for index 49
Only one value 7 for index 32
Only one value 9 for index 50
! Cannot go any further !
Trying 7 out of a possible 7 at index 29
Only one value 1 for index 32
Only one value 2 for index 33
Only one value 2 for index 47
Trying 7 out of a possible 7 9 at index 49
Only one value 9 for index 50
Only one value 6 for index 77
Only one value 1 for index 78
Only one value 5 for index 80
Only one value 5 for index 56
Only one value 8 for index 60
Only one value 2 for index 61
Only one value 8 for index 70
Only one value 6 for index 71
Only one value 9 for index 74
Only one value 2 for index 64
Only one value 4 for index 81
Only one value 4 for index 58
Only one value 9 for index 67
Only one value 8 for index 73
Only one value 2 for index 76
Done!

1, 6, 2,   8, 5, 7,   4, 9, 3,
5, 3, 4,   1, 2, 9,   6, 7, 8,
7, 8, 9,   6, 4, 3,   5, 2, 1,

4, 7, 5,   3, 1, 2,   9, 8, 6,
9, 1, 3,   5, 8, 6,   7, 4, 2,
6, 2, 8,   7, 9, 4,   1, 3, 5,

3, 5, 6,   4, 7, 8,   2, 1, 9,
2, 4, 1,   9, 3, 5,   8, 6, 7,
8, 9, 7,   2, 6, 1,   3, 5, 4,

If you have a solution that can statically detect more cells to fill using different techniques than I have, or has a better strategy than simply backtracking when there's more than one value, then I'd be interested to know it works.

I'm pretty sure the XSLT is as good as it can be, but if you think it can be improved in any way then let me know.

Monday, July 23, 2007

Combining XSLT 2.0's Grouping with eXist

I work a lot with large XML datasets that are arranged as thousands of 1 - 10mb XML files. I spend most of my days writing transforms and transformation pipelines to process these files, which is where Kernow came from. I also like messing around with eXist (I'm yet to use it commercially, but I hope to one day) and enjoying the speed a native XML database gives you.

Requirements that regularly come up are to generate indexes and reports for the dataset. This is nice and simple using XSLT 2.0's grouping but require the whole dataset to be in memory, unless you use saxon:discard-document(). It can also be quite slow, if only because you have to read GB's from disk and parse the whole of each and every XML input file to just get the snippet that you're interested in (such as the title, or say all of the elements).

Conversely, XQuery doesn't suffer from the dataset size but lacks XSLT 2.0's grouping features. It's perfectly possible (although a bit involved - you could say "a bit XSLT 1.0") to recreate the grouping in XQuery, but it's just so much nicer in XSLT 2.0. So to get the best of both, you can use eXist's fanstastic REST style interface to select the parts of the XML you're interested in, and then use XSLT 2.0's for-each-group to arrange the results.

In the example stylesheet below I create an index by getting the <title> for each XML document, and then grouping the titles by their first letter, then sorting by title itself. I use eXist to get the <title> element, then XSLT 2.0 to do the sorting and grouping.

I have an instance of eXist running on my local machine and fully populated with the XML dataset. The function fn:eXist() takes the collection I'm interested in and the XQuery to execute against that collection, constructs the correct URI for the REST interface and calls doc() with that URI. The result is a proprietary XML format containing each tuple that I then group using xsl:for-each-group. It's worth noting the -1 value for the _howmany parameter on the query - without this it defaults to 10.

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fn="fn"
xmlns:exist="http://exist.sourceforge.net/NS/exist"
version="2.0">

<xsl:output indent="yes" />

<xsl:param name="db-uri" select="'http://localhost:8080/exist/rest'" />

<xsl:function name="fn:eXist">
    <xsl:param name="collection" />
    <xsl:param name="query" />
    <xsl:sequence select="doc(concat($db-uri, $collection, '?_query=', $query, '&_start=1&_howmany=-1'))/exist:result/node()" />
</xsl:function>

<xsl:template match="/">
    <div>
        <xsl:for-each-group select="fn:eXist('/db/mycomp/myproject', '/doc/head/title')" group-by="substring(., 1, 1)">
            <xsl:sort select="." />
            <div>
                <div><xsl:value-of select="current-grouping-key()" /></div>
                <xsl:for-each select="current-group()">
                    <xsl:sort select="." />
                    <div><xsl:value-of select="." /></div>
                </xsl:for-each>
            </div>
        </xsl:for-each-group>
    </div>
</xsl:template>

</xsl:stylesheet>

It's as simple as that... what would normally take minutes takes seconds (once the database setup is done). If you haven't used eXist yet I highly recommend it.

This article is repeated here

Wednesday, July 11, 2007

CSV to XML transform updated

I've posted a new version of the CSV to XML transform. This version handles nested quotes correctly - the previous version would generate extra tokens either side of the quoted value.