Convert Docx to LaTeX!

  • March 24, 2011 18:34

Just stumbled across an interesting link that has info on converting a Microsoft Docx file into a latex file! Harri Kiiskinen over at http://pastcounts.wordpress.com/ wrote up an XSL stylesheet that can match elements in Microsofts OOXML format and print out the latex formatting.

The actual information on doing this all is located here: http://pastcounts.wordpress.com/2011/03/22/using-xsl-to-convert-docx-to-latex/

First, you need to break open the .docx file. It basically is a simple zipped archive, so an ‘unzip testdoc.docx’ should do the trick; you’ll end up with several files and sub-directories, of which only the directory called ‘word’ is necessary for this test.

Second, here’s the XSL transformation to save in a file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

<xsl:template match="/w:document">
\documentclass{article}
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="w:body">
\begin{document}
<xsl:apply-templates/>
\end{document}
</xsl:template>

<xsl:template match="w:p">
<xsl:apply-templates/><xsl:if test="position()!=last()"><xsl:text>

</xsl:text></xsl:if>
</xsl:template>

<xsl:template match="w:r">
<xsl:if test="w:footnoteReference"><xsl:text>\footnote{</xsl:text>
<xsl:call-template name="footnote">
<xsl:with-param name="fid"><xsl:value-of select="//@w:id"/></xsl:with-param>
</xsl:call-template>
<xsl:text>}</xsl:text>
</xsl:if>
<xsl:if test="w:rPr/w:b"><xsl:text>\textbf{</xsl:text></xsl:if>
<xsl:call-template name="pastb"/>
<xsl:if test="w:rPr/w:b"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>

<xsl:template name="pastb">
<xsl:if test="w:rPr/w:i"><xsl:text>\textit{</xsl:text></xsl:if>
<xsl:call-template name="pasti"/>
<xsl:if test="w:rPr/w:i"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>

<xsl:template name="pasti">
<xsl:apply-templates select="w:t"/>
</xsl:template>

<xsl:template name="footnote">
<xsl:param name="fid"/>
<xsl:apply-templates select="document('footnotes.xml')/w:footnotes/w:footnote[@w:id=$fid]"/>
</xsl:template>

<xsl:template match="//w:footnote">
<xsl:apply-templates select="w:p"/>
</xsl:template>

</xsl:stylesheet>

You can save that in a file called docxtolatex.xsl in the ‘word’ directory. Then, in that directory, run ‘xsltproc docxtolatex.xsl document.xml’, and you’ll have your screen full of the document, in LaTeX markup.

You’ll notice, that this XSLT only converts bold, italics and footnotes. But then again, that’s what I often only need to convert…

So yea..I’ll definitely use this to convert some word docs I have that I’ve been wanting to push into latex format. I also think I might do some additional research into tweaking this XSL so that *.docx files could potentially be converted to LaTeX, in their entirety! :D

Also — in order to successfully post a copy of the XSL stylesheet above, I found myself needing a script to safely escape all the xml entities….if you’re interested, here’s that script I just slapped together for doing this:

#!/usr/bin/env php
<?php
$handle = @fopen($argv[1], "r");
if ($handle) {
    while (($buffer = fgets($handle, 4096)) !== false) {
        echo htmlentities($buffer);
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);
}
?>

Simply copy the above script into a php file, make it executable, and then run it with an input file as an argument and it’ll spit out whatever XML input you give it the encoded version of the markup. :)

Care to leave a comment?

1 Comment on Convert Docx to LaTeX!

  1. XSL to extract DOCX comments into plain text | Geoffrey Anderson - April 25, 2011 at 19:29

    [...] needed the comments in plain text….so I recalled the XSL for converting DOCX to LaTeX from my last post and wrote up a new stylesheet to extract comments. Hereeee it [...]

Leave a comment