XSL to extract DOCX comments into plain text
So..this was an impromptu project I slapped together in about 20 minutes to extract comments out of a DOCX file. I ended up doing this because I stored answers to lab questions as comments in a DOCX and one of the graders I work with needed the comments in plain text….so I recalled the XSL for converting DOCX to LaTeX from my last post and wrote up a new stylesheet to extract comments. Hereeee it is!
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" version="1.0" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"> <xd:doc scope="stylesheet"> <xd:desc> <xd:p><xd:b>Created on:</xd:b> Apr 25, 2011</xd:p> <xd:p><xd:b>Author:</xd:b> Geoffrey Anderson</xd:p> <xd:p><xd:b>E-mail:</xd:b> geoff@geoffreyanderson.net</xd:p> <xd:p><xd:b>Website:</xd:b> http://geoffreyanderson.net</xd:p> </xd:desc> </xd:doc> <xsl:variable name="newline"> <xsl:text> </xsl:text> </xsl:variable> <xsl:template match="/"> <xsl:for-each select="/w:comments/w:comment"> ################ # Comment #<xsl:number value="position()" format="1"/> # ################ <xsl:for-each select="w:p"> <xsl:value-of select="$newline" /> <xsl:for-each select="w:r"> <xsl:value-of select="w:t"/> </xsl:for-each> </xsl:for-each> ---- </xsl:for-each> </xsl:template> </xsl:stylesheet>
The bad indenting is intentional so that you get output without weird tabbing/formatting. To use this (under Ubuntu, at least) simply unzip the DOCX file:
$ unzip someWordDoc.docx -d someWordDocDir/
And run the above XSL against the comments.xml file under the “word” directory:
$ xsltproc convertDocxCommentsToPlainText.xsl someWordDocDir/word/comments.xml
By doing this, you’ll get output similar to the following:
################ # Comment #1 # ################ text of the first comment ---- ################ # Comment #2 # ################ text of the second comment ----
Cheers!
Leave a Reply