Suppose you would like to change the content of Word document. For instance to automate a report. Of course you have VBA and Word macros in order to do this. Microsoft does not allow to use Word on server side. When you want to create a Word document as an action of server, and for instance deliver attachment in RESTful webservice, you have to use different technology. I am going to cover all possible approaches to this problem
I had to automate the generation of Word documents. The report to automate was relatively big word document (report). Contained many information about water well. The input: it was a text files with information about water well. Checkbox and Characterfield were the two most extensively used inside the form. The most important requirement for the client was to have the custom fields (like checkboxes) as Word-native as possible and as versatile as it was inside native Word document. I have chosen C++ and Xerces as the tools for software development.
Word file format is changing
Microsoft is constantly working and changing Word file format. A very good resource with format description is available here: https://www.loc.gov/preservation/digital/formats/fdd/fdd000397.shtml The latest changes made in December 2017 are marked as “major”. This is the official change log:
If you want to use the latest Word features, you will have to use the official documentation and most likely, the only way of automation of newest features is going to be WordprocessingML (this means control over complete XML structure).
Firstly, Word is a ZIP file. If you extract it to a directory you will see a structure. The most important file is word/document.xml. This is where you place real text to your document.
Secondly, there is a file: where you place all your relationships. There are two ways of image insertion. One is that you put a file inside a media directory and add a relationship to this file with approptiate tag. If you want to use this image, you use the relationship ID inside your document.xml.
A very nice blog entry which explains the structure of Word document can be found here: https://www.toptal.com/xml/an-informal-introduction-to-docx
approaches on how to automate modification of your Word document.
what is not advised at all
My client wanted me to exchange a phrases inside XML (Word document). He had about 100 different phrases to exchange. He wanted to generate a report. He told me to go ahead 100 times from the root of xml to the latest leaf and if I find corresponding text value, exchange it. He had 100 different entries, so the software should be changing one value at one time. This is not advised at all. Nobody can guarantee that there will not be two (or more) the same entries within one document, especially if you are working with numbers. This is not a solution at all. XML is not really a file format, it has a stack based nature and treating XML as pure based file format (like txt) is the biggest misconception in my opinion.
Through ATTACHED XML to word assembly
You may save any file within Word file you want (as this is ZIP archive). There is really no problem with generation of XML file with the data you would like use inside the word document and put it inside Word assembly. The connection between XML and Word document is very nice and simple. You just drag and drop fields from the XML tab onto a Word document. In order to automate the generation of word document, you just extract the word document, replace the xml file and the content of word document is exchanged. Done!
This approach will work with most cases. It works only for replacement of text. In most cases this is the correct solution to go: to create a valid XML, import XML file to Word document, create the first document by dragging xml elements onto Word document and later, in whatever programming language just exchange the xml file.
through XSLT transformation
In this approach you use the tools which are available for XML processing. Simply XSLT. If you want to provide multiple output possibilites: like export to PDF, export to HTML, export to Word, this is probably the best option to go.
A complete XSLT description is available here. Have a look on it!
Word is an XML document. You may modify your original word document and insert XSLT transformations. As a result you will have XSL transformation file and will be ready to transform your document. You take a word document (document.xml) file and make a corresponding XSLT. You put all your new values inside XSLT, make it run and have it on your table.
Approach with using XSLT is more focused on exchanging file format (from xml to word, from xml to PDF, from xml to html etc) rather then text content. But could be used also to exchange content of word document. You simply put another content inside your XSLT transformation.
Please follow one or both tutorials if you want to try Word modification using XSLT:
XSLT – legacy of functional programming
Writing XSLT resembles functional programming. The variables in XSLT are made in declarative type (like in functional language). This means it is getting a value while you first declare it. This could be a drawback when you want to do some complex processing of your word document. Hopefully you have functions which allow you to compute different values in your XSLT operation. This could be particularly useful for computing dimensions of layout or any numerical value which should be precomputed prior to saving to a new file format/new document
A very nice tutorial on advanced XSLT is to be read here (in Polish)
WordprocessingML is XML based language in which Word documents are written. You may create anything you want when you know WordprocessingML. You may add correction of your text, you may add any image you would like, etc etc. You will have control over each element’s size. Anything what is available in Word, will be available to you. The following website is a very nice introduction into WordprocessingML: http://officeopenxml.com/anatomyofOOXML.php It clearly lets you start with providing correct Word XML syntax.
Microsoft published a C# library to manipulate some WordprocessingML entities. You may have a look on official Microsoft’s Open XML SDK here I used C++ and Xerces and provided my own intelligence in C++ over the WordProcessingML entities. I personally had to work with complicated fields like Checkboxes, which probably are not covered in Microsotf’s library. Moreover, I am big fan of C++ and if possible, do not want to touch C#.
The process of covering the document with pure XML/WordprocessingML is probably the most labourous. It pays off greatly when you will have to control the content of generated document at the detail. Suppose you are from US/or from Europe, where you write from left to right, if you want to deliver your report in any Arabic language (written from left to right), you have to maintain a control over details. The same is with computation of picture size’s: if you add a simple image to word document, word will not scale it. In order to compute the automatic insertion of pictures to the word document, you will have to compute its size, what is possible only when using pure WordprocessingML (XML file format). A very helpful article on computing sizes inside Word document can be found here: https://startbigthinksmall.wordpress.com/2010/01/04/points-inches-and-emus-measuring-units-in-office-open-xml/
Evaluation of methods
why embedding XML inside word doc was not suitable?
Putting your xml document inside a word document, and connecting it with Word content is great, fast and probably the easiest way. You even do not have know what is XSLT and how to use it. If only you have to exchange the text, and you always keep the same information inside the document, it is great solution. If only you have to change the content of the document on the fly, embedding of XML into document is not sufficient any longer.
In the report I was creating, there were materials used in water well. The report was limited in space, and was intended to be on one side if possible. There were few checkboxes in the report like: concrete, metal etc. Once we got “plastic” from our input file, where a well description was saved, the only way to go was to throw away one option from the word report, and exactly in this place, insert a new element, which was reponsible for the new material. Another example: there was equipment attached to water well. There was standard place for 4 pieces inside the report. A structure of report had to be changed in case there was more then 4 pieces of equipment. Those changes were conducted with WordprocessingML. Unfortunately, exchanging XML file does not provide solutions for word structure modifications. You may use XSLT transformations to add additional rows, fields etc etc.
why XSLT was a bad idea?
The XSLT is great for creating new value: like creation of html document out of XML document. It is fast, realiable, not that complicated.
XSLT is great when you transform the structure of the document from one format, to another format. then XSLT will be nice. In this case, the content of the document had to be changed (as new input data are coming) and the structure of the document had to be changed. Basically I would have to programmatically create XSLT file to change the document text, what seemed difficult to mantain in the future. The pure XSLT is responsible for changing structure, and additionally I would have to change text content (this means dynamical creation of XSLT). As for me, dynamic creation of XSLT was a dangerous step with unpredictable future and moving me into lots of maintanence effort.
If you have to modify a word document, you have to cover many namespaces, and lots of elements will be difficult to transform with XSLT as well. This is not a good technology for manipulation of word document. It is difficult to pass it through XSLT.
when should you never consider XSLT?
If you care about in detail layout of your Word document, you should go ahead to WordprocessingML and never consider anything else. Especially “difficult” problem is: when do we enter a new page in Word document? Once you are in any programming language and work with WordprocessingML, you have access to data on how to compute paging. You do not have access for paging from within XSLT.
Why programming wordprocessingml was good? 🙂
It secured the future. I can add as many pieces of equipment as I want, I can exchange materials if required, I can control the size of pictures. I covered some interesting (for this particular purpose) elements of native XML tags, which form WordprocessingML. As a result the native Microsoft Word software reads my files and does not recognize a difference. I simply provided a 100% Word compatible file. Of course I invested much more time on development. As for me – it was correct solution.
- 50 years ago Add