Convert Docx to LaTeX!

Just stumbled across an interesting link that has info on converting a Microsoft Docx file into a latex file! Harri Kiiskinen over at http://pastcounts.wordpress.com/ wrote up an XSL stylesheet that can match elements in Microsofts OOXML format and print out the latex formatting.

The actual information on doing this all is located here: http://pastcounts.wordpress.com/2011/03/22/using-xsl-to-convert-docx-to-latex/

First, you need to break open the .docx file. It basically is a simple zipped archive, so an ‘unzip testdoc.docx’ should do the trick; you’ll end up with several files and sub-directories, of which only the directory called ‘word’ is necessary for this test.

Second, here’s the XSL transformation to save in a file:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">

<xsl:template match="/w:document">
\documentclass{article}
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="w:body">
\begin{document}
<xsl:apply-templates/>
\end{document}
</xsl:template>

<xsl:template match="w:p">
<xsl:apply-templates/><xsl:if test="position()!=last()"><xsl:text>

</xsl:text></xsl:if>
</xsl:template>

<xsl:template match="w:r">
<xsl:if test="w:footnoteReference"><xsl:text>\footnote{</xsl:text>
<xsl:call-template name="footnote">
<xsl:with-param name="fid"><xsl:value-of select="//@w:id"/></xsl:with-param>
</xsl:call-template>
<xsl:text>}</xsl:text>
</xsl:if>
<xsl:if test="w:rPr/w:b"><xsl:text>\textbf{</xsl:text></xsl:if>
<xsl:call-template name="pastb"/>
<xsl:if test="w:rPr/w:b"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>

<xsl:template name="pastb">
<xsl:if test="w:rPr/w:i"><xsl:text>\textit{</xsl:text></xsl:if>
<xsl:call-template name="pasti"/>
<xsl:if test="w:rPr/w:i"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>

<xsl:template name="pasti">
<xsl:apply-templates select="w:t"/>
</xsl:template>

<xsl:template name="footnote">
<xsl:param name="fid"/>
<xsl:apply-templates select="document('footnotes.xml')/w:footnotes/w:footnote[@w:id=$fid]"/>
</xsl:template>

<xsl:template match="//w:footnote">
<xsl:apply-templates select="w:p"/>
</xsl:template>

</xsl:stylesheet>

You can save that in a file called docxtolatex.xsl in the ‘word’ directory. Then, in that directory, run ‘xsltproc docxtolatex.xsl document.xml’, and you’ll have your screen full of the document, in LaTeX markup.

You’ll notice, that this XSLT only converts bold, italics and footnotes. But then again, that’s what I often only need to convert…

So yea..I’ll definitely use this to convert some word docs I have that I’ve been wanting to push into latex format. I also think I might do some additional research into tweaking this XSL so that *.docx files could potentially be converted to LaTeX, in their entirety! 😀

Also — in order to successfully post a copy of the XSL stylesheet above, I found myself needing a script to safely escape all the xml entities….if you’re interested, here’s that script I just slapped together for doing this:

#!/usr/bin/env php
<?php
$handle = @fopen($argv[1], "r");
if ($handle) {
    while (($buffer = fgets($handle, 4096)) !== false) {
        echo htmlentities($buffer);
    }
    if (!feof($handle)) {
        echo "Error: unexpected fgets() fail\n";
    }
    fclose($handle);
}
?>

Simply copy the above script into a php file, make it executable, and then run it with an input file as an argument and it’ll spit out whatever XML input you give it the encoded version of the markup. 🙂

Database Operations Engineer at Box, Inc., RIT Grad, and all around Linux and database guy.

Posted in computing, latex, linux, personal, php, programming, xml, xsl Tagged with: , , , , , , ,
One comment on “Convert Docx to LaTeX!
  1. elena says:

    Hi, the code works pretty well, however, as also posted on Harri’s blog, the footnotes are not converted properly.
    The content of the first footnote (w:id=”1″) is reapeted over all the footnotes.
    It seems the code does not grab properly the content of the subsequent notes but unfortunately Harri did not answered this issue.
    Any idea on how to fix this ?

    Elena

1 Pings/Trackbacks for "Convert Docx to LaTeX!"
  1. […] needed the comments in plain text….so I recalled the XSL for converting DOCX to LaTeX from my last post and wrote up a new stylesheet to extract comments. Hereeee it […]

Leave a Reply