XSL to extract DOCX comments into plain text
So..this was an impromptu project I slapped together in about 20 minutes to extract comments out of a DOCX file. I ended up doing this because I stored answers to lab questions as comments in a DOCX and one of the graders I work with needed the comments in plain text….so I recalled the XSL for converting DOCX to LaTeX from my last post and wrote up a new stylesheet to extract comments. Hereeee it is!
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" version="1.0"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xd:doc scope="stylesheet">
<xd:desc>
<xd:p><xd:b>Created on:</xd:b> Apr 25, 2011</xd:p>
<xd:p><xd:b>Author:</xd:b> Geoffrey Anderson</xd:p>
<xd:p><xd:b>E-mail:</xd:b> geoff@geoffreyanderson.net</xd:p>
<xd:p><xd:b>Website:</xd:b> http://geoffreyanderson.net</xd:p>
</xd:desc>
</xd:doc>
<xsl:variable name="newline">
<xsl:text>
</xsl:text>
</xsl:variable>
<xsl:template match="/">
<xsl:for-each select="/w:comments/w:comment">
################
# Comment #<xsl:number value="position()" format="1"/> #
################
<xsl:for-each select="w:p">
<xsl:value-of select="$newline" />
<xsl:for-each select="w:r">
<xsl:value-of select="w:t"/>
</xsl:for-each>
</xsl:for-each>
----
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
The bad indenting is intentional so that you get output without weird tabbing/formatting. To use this (under Ubuntu, at least) simply unzip the DOCX file:
$ unzip someWordDoc.docx -d someWordDocDir/
And run the above XSL against the comments.xml file under the “word” directory:
$ xsltproc convertDocxCommentsToPlainText.xsl someWordDocDir/word/comments.xml
By doing this, you’ll get output similar to the following:
################
# Comment #1 #
################
text of the first comment
----
################
# Comment #2 #
################
text of the second comment
----
Cheers!
Leave a Reply