Friday, February 23, 2007

A CSV to XML converter in XSLT 2.0

* Note: This transform has been updated http://andrewjwelch.com/code/xslt/csv/csv-to-xml_v2.html


I wrote a rudimentary csv to XML converter a while back which broke when the csv contained quoted values, eg foo, "foo, bar", bar Dealing with these quotes is surprisingly hard, especially when you take into account quotes are escaped by doubling them.

I raised it on xsl-list, and Abel Braaksma came up with a genious solution - the technique is to use both sides of analyze-string - read his post for the best explanation.

To keep the transform generic I've used an attribute instead of an element for the column names to cope with names that aren't valid QNames (for example ones that contain a space) - for my own use would add a function to convert names to valid QNames and then change <elem name="{...}"> to <xsl:element name="{fn:getQName(...)}"> as it generates nicer XML. I've also modified the non-matching-substring side of analyze-string to only return tokens that contain values.

So this sample input:

Col 1, Col 2, Col 3
foo, "foo,bar", "foo:""bar"""


...creates this output:

<root>
<row>
<elem name="Col 1">foo</elem>
<elem name="Col 2">foo,bar</elem>
<elem name="Col 3">foo:"bar"</elem>
</row>
</root>

Here's the finished transform:


<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:fn="fn"
exclude-result-prefixes="xs fn">

<xsl:output indent="yes" encoding="US-ASCII"/>

<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'"/>

<xsl:function name="fn:getTokens" as="xs:string+">
<xsl:param name="str" as="xs:string"/>
<xsl:analyze-string regex='("[^"]*")+' select="$str">
<xsl:matching-substring>
<xsl:sequence select='replace(., "^""|""$|("")""", "$1")'/>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:for-each select="tokenize(., '\s*,\s*')">
<xsl:sequence select="."/>
</xsl:for-each>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:function>

<xsl:template match="/" name="main">
<xsl:choose>
<xsl:when test="unparsed-text-available($pathToCSV)">
<xsl:variable name="csv" select="unparsed-text($pathToCSV)"/>
<xsl:variable name="lines" select="tokenize($csv, '&#xa;')" as="xs:string+"/>
<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+"/>
<root>
<xsl:for-each select="$lines[position() > 1]">
<row>
<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+"/>

<xsl:for-each select="$elemNames">
<xsl:variable name="pos" select="position()"/>
<elem name="{.}">
<xsl:value-of select="$lineItems[$pos]"/>
</elem>
</xsl:for-each>
</row>
</xsl:for-each>
</root>
</xsl:when>
<xsl:otherwise>
<xsl:text>Cannot locate : </xsl:text>
<xsl:value-of select="$pathToCSV"/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

22 comments:

lou said...

I confess -- I'm a newbie when it comes to using XSLT. Can you describe the steps to run this XSLT? I have the .CSV file, but I don't know what the next step is.

Thanks

Neil said...

Andrew - this looks really good and exactly what I am looking for.

However I get an error when I try to run this as follows:

$ java -cp "c:\qdata\saxon\saxon8.jar" net.sf.saxon.Transform -it main csv2xml3
.xsl
Error on line 11 of file:/c:/qdata/Telco/csv2xml3.xsl:
XTTE0780: An empty sequence is not allowed as the result of function fn:getTo
ens()
Transformation failed: Run-time errors were reported


Any ideas what might cause that?

The line is the start of the function definition.

Andrew Welch said...

Possibly leading or trailing new lines causing empty lines to be processed, eg an extra CR on the end of the last line.

I should probably modify this to cope better...

Andrew Welch said...

By the way, make sure you use the latest version which is linked to at the top of the page, or just click here

Neil said...

Strange. The version on this page now works fine (not sure what I did wrong before) but the version on the page that you linked to gives this error:

$ java -cp saxon/saxon8.jar net.sf.saxon.Transform -it main csv2xml2.xsl
Error on line 9 of file:/c:/qdata/Telco/csv2xml2.xsl:
XPDY0002: The context item for axis step child::element(test.csv) is undefined

Transformation failed: Run-time errors were reported

The CSV file is:

Col 1, Col 2, Col 3
foo, "foo,bar", "foo:""bar"""
one, two, three

Thanks for your help. I will use the version I have working now.

Andrew Welch said...

That error suggests it's treating the path to the csv as an XPath and not a string, which is usually because the parameter value is not wrapped in single quotes.

The $pathToCSV parameter here should really be typed as xs:string which would give a much better error message. One more thing on the TODO list....

Andrew Welch said...

So just to be explicit - when you've changed the default parameter value you've deleted the single quotes too. You have to ensure that your new value (which ends in "test.csv") is wrapped in single quotes to make it a string. (or just supply the new value as a parameter to the transform, rather than edit the default)

If you wanted to type the parameter as I mentioned in my last comment then change the parameter definition to:

<xsl:param
name="pathToCSV"
select="'file:///c:/csv.csv'"
as="xs:string"/>

Adding the "as" attribute will ensure that $pathToCSV is always a single string and give a helpful error message if you forget the single quotes.

If you haven't already, take a look at Kernow which is a GUI for Saxon so you don't have to use the command line.

Neil said...

Thanks that was the problem and the change has fixed things and made the errors better.

I'll take a look at Kernow.

GD1226 said...

Is this only used to manually transform a csv to xml or can it be used or modified to run on a server? I have a website that executes .xslt pages and I want to include .csv files and read them from there - The error I'm getting now is:

'unparsed-text-available()' is an unknown XSLT function.

Any ideas if there is a solution? Thanks!

Andrew Welch said...

This is an XSLT 2.0 transform so it requires an XSLT 2.0 processor, such as Saxon 9.

Allen D'Souza said...

Hi Andrew,

We are looking for something that takes a detailed CSV and gives us a summarized XML based on some metadata.

For example, if we have a CSV that lists all the people in the US in the following format.

State, County, City, Name

We'd like to take this CSV and get an XML that shows for each "State" the "COUNT" of persons. "State" and "COUNT" would be parameterized.

What would be the most scalable approach to converting such CSV to a Grouped-XML?

Appreciate the help.

Andrew Welch said...

I would do it in 2 stages: convert the CSV to XML, then use XSLT 2.0 to do the grouping.

Allen D'Souza said...

Thanks Andrew. That's an option we considered but were wondering if the performance would take a hit, in case we used a Dynamic XSLT (with parameters). The grouping would be decided based on metadata and isn't static.

If performance would be an issue, is there a more performant way to do it?

Andrew Welch said...

I really wouldn't expect performance to be problem here.

I would go with the two stage approach, measure the performance, and then if it really is a problem explore other solutions.

For assistance with the grouping, maybe join the xsl-list (google it) and ask on there.

Alternatively, put together a set of requirements and I'll give you a quote :)

akusa said...

Hi Andrew:

If I want to use this program to batch convert some csv files, how do I do it?

Any help is appreciated

smougenot said...

Hi, nice work!
Unfortunately I can't manage to run it using :
- java 1.6.0_18 or 1.5.0_20
- kernow 1.6.1
- saxon 9.1.0.8

Whatever I use for the csv file path it always complain that the file can't be located.
ex : Cannot locate : file:///d:/csv.csv

I've the same problem using command line on saxon alone.

Could someone help me please.

Minh Duong said...

hi
thanks for you post. It help me very much. I just have one question.
My .csv file content UTF-8 character like: Jürgen
The xml will export successful when i remove this character.

Unknown said...
This comment has been removed by the author.
Unknown said...

Replace the encoding :
encoding="US-ASCII"
by
encoding="UTF-8"

It should help

Minh Duong said...
This comment has been removed by the author.
Minh Duong said...

I did it but it does not work.

My data in .csv file is
Datum;Tag;Ein;Aus;Total;Task;Arbeitsnotiz;Tag Total;Tagesnotiz
23.11.2011;Mi;18:00;18:00;00:00;Safety;Jürgen 1131 User Lib;;

It alway show "Cannot locate :" problem, but i remove "Jürgen" text, then it works.
Could someone help me please.

Minh Duong said...

I changed it like this:
xsl:output method="xml" encoding="utf-8"