XSL to extract DOCX comments into plain text

So..this was an impromptu project I slapped together in about 20 minutes to extract comments out of a DOCX file. I ended up doing this because I stored answers to lab questions as comments in a DOCX and one of the graders I work with needed the comments in plain text….so I recalled the XSL for converting DOCX to LaTeX from my last post and wrote up a new stylesheet to extract comments. Hereeee it is!

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" version="1.0"
    xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
    <xd:doc scope="stylesheet">
        <xd:desc>
            <xd:p><xd:b>Created on:</xd:b> Apr 25, 2011</xd:p>
            <xd:p><xd:b>Author:</xd:b> Geoffrey Anderson</xd:p>
            <xd:p><xd:b>E-mail:</xd:b> geoff@geoffreyanderson.net</xd:p>
            <xd:p><xd:b>Website:</xd:b> http://geoffreyanderson.net</xd:p>
        </xd:desc>
    </xd:doc>

    <xsl:variable name="newline">
        <xsl:text>
        </xsl:text>
    </xsl:variable>
    <xsl:template match="/">
        <xsl:for-each select="/w:comments/w:comment">
################
# Comment #<xsl:number value="position()" format="1"/> #
################
<xsl:for-each select="w:p">
<xsl:value-of select="$newline" />
<xsl:for-each select="w:r">
<xsl:value-of select="w:t"/>
</xsl:for-each>
</xsl:for-each>

----

</xsl:for-each>
    </xsl:template>
</xsl:stylesheet>


The bad indenting is intentional so that you get output without weird tabbing/formatting. To use this (under Ubuntu, at least) simply unzip the DOCX file:

$ unzip someWordDoc.docx -d someWordDocDir/


And run the above XSL against the comments.xml file under the “word” directory:

$ xsltproc convertDocxCommentsToPlainText.xsl someWordDocDir/word/comments.xml


By doing this, you’ll get output similar to the following:

################
# Comment #1   #
################

        text of the first comment

----

################
# Comment #2   #
################

        text of the second comment

----


Cheers!

Database Operations Engineer at Box, Inc., RIT Grad, and all around Linux and database guy.

Posted in computing, latex, linux, programming, xml, xsl Tagged with: , , , ,

Leave a Reply