====== Word document generation ====== This page explores the solutions to generate Word document on server side (eg. a webserver). My need is to generate full documents automatically which can later be edited by users, hence the choice of Word format. ===== Possible solutions ===== * Generate binary .doc files. Are you kidding me ? * Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft ! * Use the COM control to pilot Word on the server. BANNED. Microsoft strongly [[http://support.microsoft.com/default.aspx?scid=kb;EN-US;257757|advises]] **against** this solution because Office is "**unstable**": "//Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment.//" Doh! * Generate RTF documents. ABANDONED. This format is not documented enough. * OpenOffice.org/LibreOffice piloting. ABANDONED. I don't want to mangle with OOo automation (Last time I checked, Python API was awful). And that would not be more reliable than MS Office. * Expensive, commercial libraries. ABANDONED. I don't want my applications to be too tied to a commercial API (which will be a pain in the ass to migrate to another API if the company goes bankrupt or causes commercial problems (phpdocx anyone ?)). * Apache POI. ABANDONED. The main maintainer left the project. * HTML generation: **CHOSEN.** ===== Why HTML ? ===== * Generating HTML is easy. * Generating HTML documents is **fast** (magnitudes faster than using a COM control). * Word can open HTML documents as if it was native Word files. Heck, you can just rename the html file to .doc and Word will open it ! * HTML documents are easy to style. Word accepts most HTML & CSS directives (font, size, color, tables, alignment...). * MS Office also recognizes some Microsoft-specific HTML elements and CSS attributes which can be used to insert page number, table-of-content and so on... * MS Office-specific HTML is more or less documented ([[http://msdn.microsoft.com/en-us/library/aa155477(v=office.10).aspx|here]]) Trick: Embedding images requires external files. Thus we cannot use a single HTML file. The solution is simple: generate several files (the main document in html, images...) then put everything in a single MIME 1.0 file. This is exactly what MS Word does when you save as **mhtml** (.mht) documents. Header & footers also must be put in a separate file: They will also be included in the MIME file. Generating MIME 1.0 files is easy, even in php. So our solution can be sum up as: * Generate the main document in HTML, using some specific Word HTML elements. * Include any additional data in seperate files (images, etc.) * Pack all the files in a MIME 1.0 file. * Serve this file as a .doc to the client. ===== Building a Word HTML document ===== Here are the different items you can/must use to build a Microsoft Office Word HTML document. These are only snippets, not full html documents. (Full HTML example documents will be in the "Examples" section.) Basically, you generate a HTML document then serve it to the client as a .doc file (both in filename extension and in MIME type). Word will open this file as if it was a simple .doc file. ==== HTML declaration==== ... ==== Document properties ==== Microsoft Office HTML Example * **size**: Page size. * **margin** : Document margins. * **mso-page-orientation**: portrait or landscape. ==== Page declaration ==== You are supposed to put pages (or group of pages) in a "section" (in a ''DIV''), like this: Microsoft Office HTML Example
I'm page 1.
I'm page 2.
* **@page** is used to set properties of the whole document. * Each **@page SectionX** can be used to change the properties of a group of pages. * Each page or group of pages must be put in a
(such as
in our example. Feel free to create as many 'sections' as needed. Caveat: Changing things like page orientation for a group or a specific page does not seem to work. FIXME ==== Standard HTML/CSS elements ==== Word will accept most standard HTML and CSS features, such as headings (h1,h2,h3...), lists (ul,li), tables, colors... Go experiment yourself. Here are some examples: Microsoft Office HTML Example

Title level 1

Title level 2

Title level 3

Text in level 3

2nd title level 2

Another level 3 title

List:
  • element 1
  • element 2
  • element 3
    • element 4
    • element 5
    • element 6
      • element 7
      • element 8
  • element 9
  • element 10
Column AColumn BColumn C
A1B1C1
A2B2 Test with looooong text: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sed sapien ac tortor porttitor lobortis. Donec velit urna, vulputate eu egestas eu, lacinia non dolor. Cras lacus diam, tempus sed ullamcorper a, euismod id nunc. Fusce egestas velit sed est fermentum tempus. Duis sapien dui, consectetur eu accumsan id, tristique sit amet ante.C2
A3B3C3
Rendering in Word: {{ :wordgen:wordgen_basic_html.png }} Note that these elements can later be styled, either using inline CSS (''

'') or using CSS stylesheets (eg. you can style all h1 elements). ==== Forcing display mode on opening ==== Microsoft Office HTML Example ... This will force the "Page" display mode when the file is opened. This section **must** be put just after the ''title'', otherwise it will not work. You can use 80 or 90 for Zoom if you want two pages to fit on screen. ==== Page break ====
==== Tables and pagebreaks ==== === Prevent a table cell from spanning over multiple pages === Put in your stylesheet: td { page-break-inside:avoid; } or apply to only specific cells: ... === Prevent tables from spanning over multiple pages === Put in your stylesheet: tr { page-break-after:avoid; } or apply to all TR of a table: ... ==== A note about computed field ==== Computed fields include TOC (table of content), page refences and so on. With Word, when you open a document, all computed fields are **not** updated by default. This has to be done manually by typing CTRL+A (to select the whole document) then press F9. Thus, all computed fields you insert in your document will not show up unless the user manually updates them. This a problem related to Word itself. There is not simple solution to this. ==== TOC (Table of content) ====

Table of content - Please right-clic and choose "Update fields".

As you can't predict the page numbers, the TOC needs to be manually updated by the user (Not a heavy burden, and I guess it's possible to automate this by including a script in the file. I'll investigate that later.) TOC before update (upon file opening): {{ :wordgen:wordgen_toc1.png }} TOC after update: It reflects the different heading levels (h1,h2,h3...). {{ :wordgen:wordgen_toc2.png }} If you want to customize the TOC, have a look at the [[http://office.microsoft.com/en-us/word-help/field-codes-toc-table-of-contents-field-HP005186201.aspx|Microsoft documentation]] about this dynamic field. ==== Bookmarks and references ==== You can reference another chapter or page rather easily. Here is an example: Set a bookmark in a document, and display the page where this bookmark is located. Bookmarks: Simply put a html anchor: Appendix Then the reference: For more information, see appendix at page ==== Header and footer ==== Headers and footers must be put in a separate file, in a subdirectory. Example: * **''mydocument.htm''** : The main document * **''mydocument_files\headerfooter.htm''** : The header and footer. __Note__: It is important that the subdirectory name starts with the main document name (**mydocument**.htm -> **mydocument**_files), otherwise Word will display a warning. Microsoft Office HTML Example
I'm page 1.
I'm page 2.
Note that the file ''filelist.xml'' does //not// need to be present, but its declaration in the main document is mandatory.

Header

Footer page 1/1

After opening the document, don't forget to go in "Page" display mode to see headers/footers. This is what you get: {{ :wordgen:wordgen_headerfooter.png }} ==== Images ==== As for the Header/Footer, images must be put in a subdirectory. Then you just use the standard '''' html tag. Example: * ''mydocument.htm'' : The main document * ''mydocument_files/logo_google.png'' : The image to include. Microsoft Office HTML Example
Here is an image:
Result in Word: {{ :wordgen:wordgen_image.png }} ==== Styling ==== You can include a CSS stylesheet in the main html file: Word will use it. You can style standard HTML elements (''h1,h2,h3...''), but you can also apply styles with the ''class'' attribute. ==== Exploring other Word documents features ==== If you are trying to find the HTML code corresponding to a Word feature, here's my advice: * Create a blank document in Word. * Type a few words and use the feature you need. * Save as HTML page (Save as => Other format => Web page (*.htm,*.html) (//**NOT**// filtered)) Then open with you favorite text editor. You are most likely to find the relevant HTML/CSS code. ===== Creating a MIME (mhtml) file ===== Once you have created your html document with its associated files (headers/footer, images...), you need to pack them in a single mhtml (MIME) file. Let's take an example: A document with header/footer and an image. The files contained in this zip ({{:wordgen:mime_example.zip}}) are: * ''mydocument.htm'' : The main document * ''mydocument_files/headerfooter.htm'' : The header and footer * ''mydocument_files/smiley.gif'' : An image. Building a MIME file only requires to encode these files in base64 and add a header for each one: MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI" ------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI Content-Location: file:///C:/mydocument.htm Content-Transfer-Encoding: base64 Content-Type: text/html; charset="utf-8" PGh0bWwgeG1sbnM6bz0ndXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlJyB4 bWxuczp3PSd1cm46c2NoZW1hcy1taWNyb3NvZnQtY29tOm9mZmljZTp3b3JkJyB4bWxucz0naHR0 cDovL3d3dy53My5vcmcvVFIvUkVDLWh0bWw0MCc+DQo8aGVhZD48dGl0bGU+TWljcm9zb2Z0IE9m ZmljZSBIVE1MIEV4YW1wbGU8L3RpdGxlPg0KPCEtLVtpZiBndGUgbXNvIDldPg0KPHhtbD48dzpX b3JkRG9jdW1lbnQ+PHc6Vmlldz5QcmludDwvdzpWaWV3Pjx3Olpvb20+MTAwPC93Olpvb20+PHc6 RG9Ob3RPcHRpbWl6ZUZvckJyb3dzZXIvPjwvdzpXb3JkRG9jdW1lbnQ+PC94bWw+DQo8IVtlbmRp Zl0tLT4NCjxsaW5rIHJlbD1GaWxlLUxpc3QgaHJlZj0ibXlkb2N1bWVudF9maWxlcy9maWxlbGlz dC54bWwiPg0KPHN0eWxlPjwhLS0gDQpAcGFnZQ0Kew0KICAgIHNpemU6MjFjbSAyOS43Y210OyAg LyogQTQgKi8NCiAgICBtYXJnaW46MWNtIDFjbSAxY20gMWNtOyAvKiBNYXJnaW5zOiAyLjUgY20g b24gZWFjaCBzaWRlICovDQogICAgbXNvLXBhZ2Utb3JpZW50YXRpb246IHBvcnRyYWl0OyAgDQoJ bXNvLWhlYWRlcjogdXJsKCJteWRvY3VtZW50X2ZpbGVzL2hlYWRlcmZvb3Rlci5odG0iKSBoMTsN Cgltc28tZm9vdGVyOiB1cmwoIm15ZG9jdW1lbnRfZmlsZXMvaGVhZGVyZm9vdGVyLmh0bSIpIGYx OwkNCn0NCkBwYWdlIFNlY3Rpb24xIHsgfQ0KZGl2LlNlY3Rpb24xIHsgcGFnZTpTZWN0aW9uMTsg fQ0KcC5Nc29IZWFkZXIsIHAuTXNvRm9vdGVyIHsgYm9yZGVyOiAxcHggc29saWQgYmxhY2s7IH0N Ci0tPjwvc3R5bGU+DQo8L2hlYWQ+DQo8Ym9keT4NCjxkaXYgY2xhc3M9U2VjdGlvbjE+DQpJJ20g cGFnZSAxIDxpbWcgc3JjPSJteWRvY3VtZW50X2ZpbGVzL3NtaWxleS5naWYiPg0KPGJyIGNsZWFy PWFsbCBzdHlsZT0nbXNvLXNwZWNpYWwtY2hhcmFjdGVyOmxpbmUtYnJlYWs7cGFnZS1icmVhay1i ZWZvcmU6YWx3YXlzJz4NCkknbSBwYWdlIDIuDQo8L2Rpdj4NCjwvYm9keT4NCjwvaHRtbD4NCg0K DQo= ------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI Content-Location: file:///C:/mydocument_files/headerfooter.htm Content-Transfer-Encoding: base64 Content-Type: text/html; charset="utf-8" PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiIHhtbG5zOm89InVy bjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9mZmljZSIgeG1sbnM6dz0idXJuOnNjaGVt YXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCIgeG1sbnM6bT0iaHR0cDovL3NjaGVtYXMubWlj cm9zb2Z0LmNvbS9vZmZpY2UvMjAwNC8xMi9vbW1sIj0geG1sbnM9Imh0dHA6Ly93d3cudzMub3Jn L1RSL1JFQy1odG1sNDAiPg0KPGJvZHk+DQoNCjxkaXYgc3R5bGU9Im1zby1lbGVtZW50OmhlYWRl cjsiIGlkPSJoMSI+DQo8cCBjbGFzcz1Nc29IZWFkZXI+SGVhZGVyPC9wPg0KPC9kaXY+DQoNCjxk aXYgc3R5bGU9J21zby1lbGVtZW50OmZvb3RlcicgaWQ9ZjE+DQo8cCBjbGFzcz1Nc29Gb290ZXI+ PHNwYW4gY2xhc3M9U3BlbGxFPkZvb3Rlcjwvc3Bhbj4gcGFnZSA8IS0tW2lmIHN1cHBvcnRGaWVs ZHNdPjxzcGFuDQpjbGFzcz1Nc29QYWdlTnVtYmVyPjxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpm aWVsZC1iZWdpbic+PC9zcGFuPjxzcGFuDQpzdHlsZT0nbXNvLXNwYWNlcnVuOnllcyc+oDwvc3Bh bj5QQUdFIDxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpmaWVsZC1zZXBhcmF0b3InPjwvc3Bhbj48 L3NwYW4+PCFbZW5kaWZdLS0+PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9 J21zby1uby1wcm9vZjp5ZXMnPjE8L3NwYW4+PC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+ PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1lbGVtZW50OmZpZWxk LWVuZCc+PC9zcGFuPjwvc3Bhbj48IVtlbmRpZl0tLT48c3Bhbg0KY2xhc3M9TXNvUGFnZU51bWJl cj4vPC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+PHNwYW4gY2xhc3M9TXNvUGFnZU51bWJl cj48c3Bhbg0Kc3R5bGU9J21zby1lbGVtZW50OmZpZWxkLWJlZ2luJz48L3NwYW4+IE5VTVBBR0VT IDxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpmaWVsZC1zZXBhcmF0b3InPjwvc3Bhbj48L3NwYW4+ PCFbZW5kaWZdLS0+PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1u by1wcm9vZjp5ZXMnPjE8L3NwYW4+PC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+PHNwYW4N CmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1lbGVtZW50OmZpZWxkLWVuZCc+ PC9zcGFuPjwvc3Bhbj48IVtlbmRpZl0tLT4NCjwvcD4NCjwvZGl2Pg0KDQo8L2JvZHk+DQo8L2h0 bWw+ ------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI Content-Location: file:///C:/mydocument_files/smiley.gif Content-Transfer-Encoding: base64 Content-Type: image/gif R0lGODlhEgASAOeQAEM0EP/lIf/mIv/jH//mI//kIP/lIvC2AN3e3f/kH9LQynBeMPvYD/zZEens 7/fLAPvXDffMAM7MxJ2Sd//iHeyrAOecAGBLF//kIfrTCOfq7dfX0//iHv7gG3dmO3hnPPC3AGRP Hf3fGfjMAJCEY+miAMzKwfvXDl9JFPK9AKZxBKCXfP7hHaGXfffKAPbJAP3dF++yAO+xAPbIAJCD YtfX0mhSHW9TB52TeG1NB8KLAmlTHvTEAK1+A//nI/7iHpySdt6uAWdQHK+KBPrSB+jr7vPBANeu AtSmAvzbE/7iHXhmPG5OB8ukAuemAJ6UePvZEG5QB/bKAHFTB5hrBKZsA9+0AfjOAsqEAvzbFM6G AZViBKyBBPPCAPPGAOyqAG9NB/TCAN+XAP7fGfG8AJKGZv7jH/7fGrt3Ap+VesKMAvTMBnZlOvTD AO25ALGMA/3cFmBKFuaZAM+NAcvJwHVjN/LEAKRrBJpwBPjPBu2uAJ99BPK/AGxLB6+EA+qpALGN A6h2BP7gGqVqA/XHAP3gGuOVAOCTAN+VAHNiNc3Lw3NXB8yLAvO/AGhRHfC4AP////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// //////////////////////////////////////////////////////////////////////////// /////////////////////////////////////////////////////yH5BAEAAP8ALAAAAAASABIA AAj+AP8JHNjiw4ULH1oMXDhwwo0hRwjxQMIlygSGAhfsWQOhQYMTRFy4wbOAoY0mEOCcoUChAwwG I1Ko2TGQBCAIYzgMGFCgwABBDB6ACFRG4KI8MHQCKIAhAIABIjKE+cPkH443EDrwBBAggAGuLKC8 kKEiTR0rDShsNSBAAAADCZI8ODBnyQUvaZkGEECgr4AEWSIcEBMijp0TS73y9UFAgBkAUmIgCpEo SIanTfn2DVAIQJsKWtgA8eOCwVPFAgIoARDhUYkqT/5NITPCNIsECX6IAHClUQVDYATS6JHiwWUA yFnz0WNhEImBQnQcMDLjQYQXXUB8sYDGEcMFVJwiyDhwIEaFEoe2lMQ4IYcKRhbkYLnT5yLGgSs8 oEDhYQXGgAA7 ------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI-- Please note: * The boundary can be any string you want, as long as you can't find it in data. (Using a dot is a safe bet because base64 data can't contain dots.) * Take extra care of dashes before and after the boundary marker. * Take extra care of empty lines. They are important. * You can use other encodings than base64 (such as quoted-printable), but base64 is a safe bet and works on all kind of files. This file can be renamed to .doc and opened in Word: {{ :wordgen:wordgen_mime.png }} **Amusing fact:** mhtml (MIME) documents are usually smaller than their true .doc counterparts. For example, the previous ''mime_example.doc'' (which is a MIME/mhtml file) is **5216 bytes** long. Resaved in true .doc format, it's **24064 bytes**. ==== A basic MIME 1.0 class helper ==== Here is a basic MIME 1.0 class which will help you generate the mhtml files: class mime10class { private $data; const boundary='----=_NextPart_ERTUP.EFETZ.FTYIIBVZR.EYUUREZ'; function __construct() { $this->data="MIME-Version: 1.0\nContent-Type: multipart/related; boundary=\"".self::boundary."\"\n\n"; } public function addFile($filepath,$contenttype,$data) { $this->data = $this->data.'--'.self::boundary."\nContent-Location: file:///C:/".preg_replace('!\\\!', '/', $filepath)."\nContent-Transfer-Encoding: base64\nContent-Type: ".$contenttype."\n\n"; $this->data = $this->data.base64_encode($data)."\n\n"; } public function getFile() { return $this->data.'--'.self::boundary.'--'; } } It's rather simple: Add files with ''addFile()'', then get the final mhtml/MIME1.0 document with ''getFile()''. Example: header('Content-Type: application/msword'); header('Content-disposition: filename=mydocument.doc') $doc = New mime10class(); $doc->addFile('mydocument.htm','text/html; charset="utf-8"','Hello, world !'); $doc->addFile('subdir\anotherfile.htm','text/html; charset="utf-8"','Hi there.'); echo $doc->getFile(); ===== Sending the file to the client ===== The .mht file must be served to the client with the following HTTP headers: Content-Type: application/msword Content-disposition: attachment; filename=myfile.doc Yes, we use ".doc" in order not to confuse the final user. Word will recognize this is a .mht file and will open it accordingly. Note that: Content-disposition: attachment; filename=myfile.doc will force download, but: Content-disposition: filename=myfile.doc will allow the user to choose between "open" or "save". ===== Examples ===== Full, working HTML documents which load correctly in Microsoft Word, using many Word features: Styling, tables, headers & footers, page format, table-of-content, images... FIXME ===== php code examples ===== FIXME ===== Performances ===== I have implemented this system in a professional environment, and we are able generate a **70 pages** database-driven dynamically-filled document with lots of tables and references in less than **5 seconds**. (The document contains no images; The server runs Apache+php+Oracle. Document templates are written in Smarty.) The performances are //excellent//, much better than what I expected. ===== Ideas worth pondering ===== * Using a [[http://en.wikipedia.org/wiki/Lightweight_markup_language|markup language]] (Markdown ? Textile ?...) which generates HTML may ease the creation and maintenance of templates. * With php, the use of a templating engine like [[http://www.smarty.net/about_smarty|Smarty]] should help to easily create and maintain document templates without touching php code too much. Template inclusion can help to create - for example - standard headers for all documents. * Serving files using gzip compression may improve user experience (our ''mime_example.doc'' above goes from 5216 bytes to 2569 bytes with default gzip compression). ~~DISCUSSION:closed~~