Outils pour utilisateurs

Outils du site


word_document_generation

Word document generation

This page explores the solutions to generate Word document on server side (eg. a webserver). My need is to generate full documents automatically which can later be edited by users, hence the choice of Word format.

Possible solutions

  • Generate binary .doc files. Are you kidding me ?
  • Generate .docx files (Afterall, that's XML, isn't it ?) BANNED. I don't have time to read a 7500 pages specification no-one is capable of implementing - not even Microsoft !
  • Use the COM control to pilot Word on the server. BANNED. Microsoft strongly advises against this solution because Office is "unstable": "Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, ASP.NET, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when Office is run in this environment." Doh!
  • Generate RTF documents. ABANDONED. This format is not documented enough.
  • OpenOffice.org/LibreOffice piloting. ABANDONED. I don't want to mangle with OOo automation (Last time I checked, Python API was awful). And that would not be more reliable than MS Office.
  • Expensive, commercial libraries. ABANDONED. I don't want my applications to be too tied to a commercial API (which will be a pain in the ass to migrate to another API if the company goes bankrupt or causes commercial problems (phpdocx anyone ?)).
  • Apache POI. ABANDONED. The main maintainer left the project.
  • HTML generation: CHOSEN.

Why HTML ?

  • Generating HTML is easy.
  • Generating HTML documents is fast (magnitudes faster than using a COM control).
  • Word can open HTML documents as if it was native Word files. Heck, you can just rename the html file to .doc and Word will open it !
  • HTML documents are easy to style. Word accepts most HTML & CSS directives (font, size, color, tables, alignment…).
  • MS Office also recognizes some Microsoft-specific HTML elements and CSS attributes which can be used to insert page number, table-of-content and so on…
  • MS Office-specific HTML is more or less documented (here)

Trick: Embedding images requires external files. Thus we cannot use a single HTML file. The solution is simple: generate several files (the main document in html, images…) then put everything in a single MIME 1.0 file. This is exactly what MS Word does when you save as mhtml (.mht) documents. Header & footers also must be put in a separate file: They will also be included in the MIME file.

Generating MIME 1.0 files is easy, even in php.

So our solution can be sum up as:

  • Generate the main document in HTML, using some specific Word HTML elements.
  • Include any additional data in seperate files (images, etc.)
  • Pack all the files in a MIME 1.0 file.
  • Serve this file as a .doc to the client.

Building a Word HTML document

Here are the different items you can/must use to build a Microsoft Office Word HTML document. These are only snippets, not full html documents. (Full HTML example documents will be in the "Examples" section.)

Basically, you generate a HTML document then serve it to the client as a .doc file (both in filename extension and in MIME type). Word will open this file as if it was a simple .doc file.

HTML declaration

<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
...
</html>

Document properties

<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title>
<style> <!-- 
@page
{
    size: 21cm 29.7cm;  /* A4 */
    margin: 2cm 2cm 2cm 2cm; /* Margins: 2 cm on each side */
    mso-page-orientation: portrait;  
}
--></style>
  • size: Page size.
  • margin : Document margins.
  • mso-page-orientation: portrait or landscape.

Page declaration

You are supposed to put pages (or group of pages) in a "section" (in a DIV), like this:

<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title>
<style><!-- 
@page
{
    size:21cm 29.7cmt;  /* A4 */
    margin:1cm 1cm 1cm 1cm; /* Margins: 2.5 cm on each side */
    mso-page-orientation: portrait;  
}
@page Section1 { }
div.Section1 { page:Section1; }
--></style>
</head>
<body>
<div class=Section1>
I'm page 1.
<br clear=all style='mso-special-character:line-break;page-break-before:always'>
I'm page 2.
</div>
</body>
</html>
  • @page is used to set properties of the whole document.
  • Each @page SectionX can be used to change the properties of a group of pages.
  • Each page or group of pages must be put in a
    <div>

    (such as

    <div class=Section1>

    in our example. Feel free to create as many 'sections' as needed.

Caveat: Changing things like page orientation for a group or a specific page does not seem to work. FIXME

Standard HTML/CSS elements

Word will accept most standard HTML and CSS features, such as headings (h1,h2,h3…), lists (ul,li), tables, colors… Go experiment yourself. Here are some examples:

basic_html.doc
<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title></head>
<body>
<h1>Title level 1</h1>
<h2>Title level 2</h2>
<h3>Title level 3</h3>
<p>Text in level 3</p>
<h2>2nd title level 2</h2>
<h3>Another level 3 title</h3>
 
List:
<ul>
<li>element 1</li>
<li>element 2</li>
<li>element 3</li>
  <ul>
  <li>element 4</li>
  <li>element 5</li>
  <li>element 6</li>
      <ul>
      <li>element 7</li>
      <li>element 8</li>
      </ul>
  </ul>
<li>element 9</li>
<li>element 10</li>
</ul>
 
<table width="100%">
<thead style="background-color:#A0A0FF;"><td nowrap>Column A</td><td nowrap>Column B</td><td nowrap>Column C</td></thead>
<tr><td>A1</td><td>B1</td><td>C1</td></tr>
<tr><td>A2</td><td>B2 Test with looooong text: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla sed sapien 
ac tortor porttitor lobortis. Donec velit urna, vulputate eu egestas eu, lacinia non dolor. Cras lacus diam, tempus 
sed ullamcorper a, euismod id nunc. Fusce egestas velit sed est fermentum tempus. Duis sapien dui, consectetur eu 
accumsan id, tristique sit amet ante.</td><td>C2</td></tr>
<tr><td>A3</td><td>B3</td><td>C3</td></tr>
</table>
 
</body>
</html>

Rendering in Word:

Note that these elements can later be styled, either using inline CSS (<p style="…">) or using CSS stylesheets (eg. you can style all h1 elements).

Forcing display mode on opening

<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title>
<!--[if gte mso 9]>
<xml>
<w:WordDocument>
<w:View>Print</w:View>
<w:Zoom>100</w:Zoom>
<w:DoNotOptimizeForBrowser/>
</w:WordDocument>
</xml>
<![endif]-->
<body>
...

This will force the "Page" display mode when the file is opened. This section must be put just after the title, otherwise it will not work. You can use 80 or 90 for Zoom if you want two pages to fit on screen.

Page break

<br clear=all style='mso-special-character:line-break;page-break-before:always'>

Tables and pagebreaks

Prevent a table cell from spanning over multiple pages

Put in your stylesheet:

td { page-break-inside:avoid; }

or apply to only specific cells:

<td style="page-break-inside:avoid;">...</td>

Prevent tables from spanning over multiple pages

Put in your stylesheet:

tr { page-break-after:avoid; }

or apply to all TR of a table:

<tr style="page-break-after:avoid;">...</tr>

A note about computed field

Computed fields include TOC (table of content), page refences and so on.

With Word, when you open a document, all computed fields are not updated by default. This has to be done manually by typing CTRL+A (to select the whole document) then press F9.

Thus, all computed fields you insert in your document will not show up unless the user manually updates them. This a problem related to Word itself. There is not simple solution to this.

TOC (Table of content)

<p class=MsoToc1> 
<!--[if supportFields]> 
<span style='mso-element:field-begin'></span> 
TOC \o "1-3" \u 
<span style='mso-element:field-separator'></span> 
<![endif]--> 
<span style='mso-no-proof:yes'>Table of content - Please right-clic and choose "Update fields".</span> 
<!--[if supportFields]> 
<span style='mso-element:field-end'></span> 
<![endif]--> 
</p>

As you can't predict the page numbers, the TOC needs to be manually updated by the user (Not a heavy burden, and I guess it's possible to automate this by including a script in the file. I'll investigate that later.)

TOC before update (upon file opening):

TOC after update: It reflects the different heading levels (h1,h2,h3…).

If you want to customize the TOC, have a look at the Microsoft documentation about this dynamic field.

Bookmarks and references

You can reference another chapter or page rather easily. Here is an example: Set a bookmark in a document, and display the page where this bookmark is located.

Bookmarks: Simply put a html anchor:

<a name="MyBookmark"></a>Appendix

Then the reference:

For more information, see appendix at page 
<!--[if supportFields]>
<span style='mso-element:field-begin'></span>PAGEREF MyBookmark \h <span style='mso-element:field-end'></span>
<![endif]-->

Headers and footers must be put in a separate file, in a subdirectory. Example:

  • mydocument.htm : The main document
  • mydocument_files\headerfooter.htm : The header and footer.

Note: It is important that the subdirectory name starts with the main document name (mydocument.htm → mydocument_files), otherwise Word will display a warning.

mydocument.htm
<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title>
<link rel=File-List href="mydocument_files/filelist.xml">
<style><!-- 
@page
{
    size:21cm 29.7cmt;  /* A4 */
    margin:1cm 1cm 1cm 1cm; /* Margins: 2.5 cm on each side */
    mso-page-orientation: portrait;  
    mso-header: url("mydocument_files/headerfooter.htm") h1;
    mso-footer: url("mydocument_files/headerfooter.htm") f1;	
}
@page Section1 { }
div.Section1 { page:Section1; }
p.MsoHeader, p.MsoFooter { border: 1px solid black; }
--></style>
</head>
<body>
<div class=Section1>
I'm page 1.
<br clear=all style='mso-special-character:line-break;page-break-before:always'>
I'm page 2.
</div>
</body>
</html>

Note that the file filelist.xml does not need to be present, but its declaration in the main document is mandatory.

mydocument_files\headerfooter.htm
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml"= xmlns="http://www.w3.org/TR/REC-html40">
<body>
 
<div style="mso-element:header;" id="h1">
<p class=MsoHeader>Header</p>
</div>
 
<div style='mso-element:footer' id=f1>
<p class=MsoFooter><span class=SpellE>Footer</span> page <!--[if supportFields]><span
class=MsoPageNumber><span style='mso-element:field-begin'></span><span
style='mso-spacerun:yes'> </span>PAGE <span style='mso-element:field-separator'></span></span><![endif]--><span
class=MsoPageNumber><span style='mso-no-proof:yes'>1</span></span><!--[if supportFields]><span
class=MsoPageNumber><span style='mso-element:field-end'></span></span><![endif]--><span
class=MsoPageNumber>/</span><!--[if supportFields]><span class=MsoPageNumber><span
style='mso-element:field-begin'></span> NUMPAGES <span style='mso-element:field-separator'></span></span><![endif]--><span
class=MsoPageNumber><span style='mso-no-proof:yes'>1</span></span><!--[if supportFields]><span
class=MsoPageNumber><span style='mso-element:field-end'></span></span><![endif]-->
</p>
</div>
 
</body>
</html>

After opening the document, don't forget to go in "Page" display mode to see headers/footers.

This is what you get:

Images

As for the Header/Footer, images must be put in a subdirectory. Then you just use the standard <img> html tag. Example:

  • mydocument.htm : The main document
  • mydocument_files/logo_google.png : The image to include.
mydocument.htm
<html xmlns:o='urn:schemas-microsoft-com:office:office' xmlns:w='urn:schemas-microsoft-com:office:word' xmlns='http://www.w3.org/TR/REC-html40'>
<head><title>Microsoft Office HTML Example</title>
<link rel=File-List href="mydocument_files/filelist.xml">
<style><!-- 
@page
{
    size:21cm 29.7cmt;  /* A4 */
    margin:1cm 1cm 1cm 1cm; /* Margins: 2.5 cm on each side */
    mso-page-orientation: portrait;  
}
@page Section1 { }
div.Section1 { page:Section1; }
p.MsoHeader, p.MsoFooter { border: 1px solid black; }
--></style>
</head>
<body>
<div class=Section1>
Here is an image:<br>
<img src="mydocument_files/logo_google.png">
</div>
</body>
</html>

Result in Word:

Styling

You can include a CSS stylesheet in the main html file: Word will use it.

You can style standard HTML elements (h1,h2,h3…), but you can also apply styles with the class attribute.

Exploring other Word documents features

If you are trying to find the HTML code corresponding to a Word feature, here's my advice:

  • Create a blank document in Word.
  • Type a few words and use the feature you need.
  • Save as HTML page (Save as ⇒ Other format ⇒ Web page (*.htm,*.html) (NOT filtered))

Then open with you favorite text editor. You are most likely to find the relevant HTML/CSS code.

Creating a MIME (mhtml) file

Once you have created your html document with its associated files (headers/footer, images…), you need to pack them in a single mhtml (MIME) file.

Let's take an example: A document with header/footer and an image. The files contained in this zip (mime_example.zip) are:

  • mydocument.htm : The main document
  • mydocument_files/headerfooter.htm : The header and footer
  • mydocument_files/smiley.gif : An image.

Building a MIME file only requires to encode these files in base64 and add a header for each one:

mime_example.doc
MIME-Version: 1.0
Content-Type: multipart/related; boundary="----=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI"
 
------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI
Content-Location: file:///C:/mydocument.htm
Content-Transfer-Encoding: base64
Content-Type: text/html; charset="utf-8"
 
PGh0bWwgeG1sbnM6bz0ndXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6b2ZmaWNlJyB4
bWxuczp3PSd1cm46c2NoZW1hcy1taWNyb3NvZnQtY29tOm9mZmljZTp3b3JkJyB4bWxucz0naHR0
cDovL3d3dy53My5vcmcvVFIvUkVDLWh0bWw0MCc+DQo8aGVhZD48dGl0bGU+TWljcm9zb2Z0IE9m
ZmljZSBIVE1MIEV4YW1wbGU8L3RpdGxlPg0KPCEtLVtpZiBndGUgbXNvIDldPg0KPHhtbD48dzpX
b3JkRG9jdW1lbnQ+PHc6Vmlldz5QcmludDwvdzpWaWV3Pjx3Olpvb20+MTAwPC93Olpvb20+PHc6
RG9Ob3RPcHRpbWl6ZUZvckJyb3dzZXIvPjwvdzpXb3JkRG9jdW1lbnQ+PC94bWw+DQo8IVtlbmRp
Zl0tLT4NCjxsaW5rIHJlbD1GaWxlLUxpc3QgaHJlZj0ibXlkb2N1bWVudF9maWxlcy9maWxlbGlz
dC54bWwiPg0KPHN0eWxlPjwhLS0gDQpAcGFnZQ0Kew0KICAgIHNpemU6MjFjbSAyOS43Y210OyAg
LyogQTQgKi8NCiAgICBtYXJnaW46MWNtIDFjbSAxY20gMWNtOyAvKiBNYXJnaW5zOiAyLjUgY20g
b24gZWFjaCBzaWRlICovDQogICAgbXNvLXBhZ2Utb3JpZW50YXRpb246IHBvcnRyYWl0OyAgDQoJ
bXNvLWhlYWRlcjogdXJsKCJteWRvY3VtZW50X2ZpbGVzL2hlYWRlcmZvb3Rlci5odG0iKSBoMTsN
Cgltc28tZm9vdGVyOiB1cmwoIm15ZG9jdW1lbnRfZmlsZXMvaGVhZGVyZm9vdGVyLmh0bSIpIGYx
OwkNCn0NCkBwYWdlIFNlY3Rpb24xIHsgfQ0KZGl2LlNlY3Rpb24xIHsgcGFnZTpTZWN0aW9uMTsg
fQ0KcC5Nc29IZWFkZXIsIHAuTXNvRm9vdGVyIHsgYm9yZGVyOiAxcHggc29saWQgYmxhY2s7IH0N
Ci0tPjwvc3R5bGU+DQo8L2hlYWQ+DQo8Ym9keT4NCjxkaXYgY2xhc3M9U2VjdGlvbjE+DQpJJ20g
cGFnZSAxIDxpbWcgc3JjPSJteWRvY3VtZW50X2ZpbGVzL3NtaWxleS5naWYiPg0KPGJyIGNsZWFy
PWFsbCBzdHlsZT0nbXNvLXNwZWNpYWwtY2hhcmFjdGVyOmxpbmUtYnJlYWs7cGFnZS1icmVhay1i
ZWZvcmU6YWx3YXlzJz4NCkknbSBwYWdlIDIuDQo8L2Rpdj4NCjwvYm9keT4NCjwvaHRtbD4NCg0K
DQo=
 
------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI
Content-Location: file:///C:/mydocument_files/headerfooter.htm
Content-Transfer-Encoding: base64
Content-Type: text/html; charset="utf-8"
 
PGh0bWwgeG1sbnM6dj0idXJuOnNjaGVtYXMtbWljcm9zb2Z0LWNvbTp2bWwiIHhtbG5zOm89InVy
bjpzY2hlbWFzLW1pY3Jvc29mdC1jb206b2ZmaWNlOm9mZmljZSIgeG1sbnM6dz0idXJuOnNjaGVt
YXMtbWljcm9zb2Z0LWNvbTpvZmZpY2U6d29yZCIgeG1sbnM6bT0iaHR0cDovL3NjaGVtYXMubWlj
cm9zb2Z0LmNvbS9vZmZpY2UvMjAwNC8xMi9vbW1sIj0geG1sbnM9Imh0dHA6Ly93d3cudzMub3Jn
L1RSL1JFQy1odG1sNDAiPg0KPGJvZHk+DQoNCjxkaXYgc3R5bGU9Im1zby1lbGVtZW50OmhlYWRl
cjsiIGlkPSJoMSI+DQo8cCBjbGFzcz1Nc29IZWFkZXI+SGVhZGVyPC9wPg0KPC9kaXY+DQoNCjxk
aXYgc3R5bGU9J21zby1lbGVtZW50OmZvb3RlcicgaWQ9ZjE+DQo8cCBjbGFzcz1Nc29Gb290ZXI+
PHNwYW4gY2xhc3M9U3BlbGxFPkZvb3Rlcjwvc3Bhbj4gcGFnZSA8IS0tW2lmIHN1cHBvcnRGaWVs
ZHNdPjxzcGFuDQpjbGFzcz1Nc29QYWdlTnVtYmVyPjxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpm
aWVsZC1iZWdpbic+PC9zcGFuPjxzcGFuDQpzdHlsZT0nbXNvLXNwYWNlcnVuOnllcyc+oDwvc3Bh
bj5QQUdFIDxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpmaWVsZC1zZXBhcmF0b3InPjwvc3Bhbj48
L3NwYW4+PCFbZW5kaWZdLS0+PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9
J21zby1uby1wcm9vZjp5ZXMnPjE8L3NwYW4+PC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+
PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1lbGVtZW50OmZpZWxk
LWVuZCc+PC9zcGFuPjwvc3Bhbj48IVtlbmRpZl0tLT48c3Bhbg0KY2xhc3M9TXNvUGFnZU51bWJl
cj4vPC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+PHNwYW4gY2xhc3M9TXNvUGFnZU51bWJl
cj48c3Bhbg0Kc3R5bGU9J21zby1lbGVtZW50OmZpZWxkLWJlZ2luJz48L3NwYW4+IE5VTVBBR0VT
IDxzcGFuIHN0eWxlPSdtc28tZWxlbWVudDpmaWVsZC1zZXBhcmF0b3InPjwvc3Bhbj48L3NwYW4+
PCFbZW5kaWZdLS0+PHNwYW4NCmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1u
by1wcm9vZjp5ZXMnPjE8L3NwYW4+PC9zcGFuPjwhLS1baWYgc3VwcG9ydEZpZWxkc10+PHNwYW4N
CmNsYXNzPU1zb1BhZ2VOdW1iZXI+PHNwYW4gc3R5bGU9J21zby1lbGVtZW50OmZpZWxkLWVuZCc+
PC9zcGFuPjwvc3Bhbj48IVtlbmRpZl0tLT4NCjwvcD4NCjwvZGl2Pg0KDQo8L2JvZHk+DQo8L2h0
bWw+
 
------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI
Content-Location: file:///C:/mydocument_files/smiley.gif
Content-Transfer-Encoding: base64
Content-Type: image/gif
 
R0lGODlhEgASAOeQAEM0EP/lIf/mIv/jH//mI//kIP/lIvC2AN3e3f/kH9LQynBeMPvYD/zZEens
7/fLAPvXDffMAM7MxJ2Sd//iHeyrAOecAGBLF//kIfrTCOfq7dfX0//iHv7gG3dmO3hnPPC3AGRP
Hf3fGfjMAJCEY+miAMzKwfvXDl9JFPK9AKZxBKCXfP7hHaGXfffKAPbJAP3dF++yAO+xAPbIAJCD
YtfX0mhSHW9TB52TeG1NB8KLAmlTHvTEAK1+A//nI/7iHpySdt6uAWdQHK+KBPrSB+jr7vPBANeu
AtSmAvzbE/7iHXhmPG5OB8ukAuemAJ6UePvZEG5QB/bKAHFTB5hrBKZsA9+0AfjOAsqEAvzbFM6G
AZViBKyBBPPCAPPGAOyqAG9NB/TCAN+XAP7fGfG8AJKGZv7jH/7fGrt3Ap+VesKMAvTMBnZlOvTD
AO25ALGMA/3cFmBKFuaZAM+NAcvJwHVjN/LEAKRrBJpwBPjPBu2uAJ99BPK/AGxLB6+EA+qpALGN
A6h2BP7gGqVqA/XHAP3gGuOVAOCTAN+VAHNiNc3Lw3NXB8yLAvO/AGhRHfC4AP//////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
/////////////////////////////////////////////////////yH5BAEAAP8ALAAAAAASABIA
AAj+AP8JHNjiw4ULH1oMXDhwwo0hRwjxQMIlygSGAhfsWQOhQYMTRFy4wbOAoY0mEOCcoUChAwwG
I1Ko2TGQBCAIYzgMGFCgwABBDB6ACFRG4KI8MHQCKIAhAIABIjKE+cPkH443EDrwBBAggAGuLKC8
kKEiTR0rDShsNSBAAAADCZI8ODBnyQUvaZkGEECgr4AEWSIcEBMijp0TS73y9UFAgBkAUmIgCpEo
SIanTfn2DVAIQJsKWtgA8eOCwVPFAgIoARDhUYkqT/5NITPCNIsECX6IAHClUQVDYATS6JHiwWUA
yFnz0WNhEImBQnQcMDLjQYQXXUB8sYDGEcMFVJwiyDhwIEaFEoe2lMQ4IYcKRhbkYLnT5yLGgSs8
oEDhYQXGgAA7
 
------=_NextPart_ZROIIZO.ZCZYUACXV.ZARTUI--

Please note:

  • The boundary can be any string you want, as long as you can't find it in data. (Using a dot is a safe bet because base64 data can't contain dots.)
  • Take extra care of dashes before and after the boundary marker.
  • Take extra care of empty lines. They are important.
  • You can use other encodings than base64 (such as quoted-printable), but base64 is a safe bet and works on all kind of files.

This file can be renamed to .doc and opened in Word:

Amusing fact: mhtml (MIME) documents are usually smaller than their true .doc counterparts. For example, the previous mime_example.doc (which is a MIME/mhtml file) is 5216 bytes long. Resaved in true .doc format, it's 24064 bytes.

A basic MIME 1.0 class helper

Here is a basic MIME 1.0 class which will help you generate the mhtml files:

class mime10class
{
    private $data;
    const boundary='----=_NextPart_ERTUP.EFETZ.FTYIIBVZR.EYUUREZ';
    function __construct() { $this->data="MIME-Version: 1.0\nContent-Type: multipart/related; boundary=\"".self::boundary."\"\n\n"; }
    public function addFile($filepath,$contenttype,$data)
    {
        $this->data = $this->data.'--'.self::boundary."\nContent-Location: file:///C:/".preg_replace('!\\\!', '/', $filepath)."\nContent-Transfer-Encoding: base64\nContent-Type: ".$contenttype."\n\n";
        $this->data = $this->data.base64_encode($data)."\n\n";
    }
    public function getFile() { return $this->data.'--'.self::boundary.'--'; }
}

It's rather simple: Add files with addFile(), then get the final mhtml/MIME1.0 document with getFile().

Example:

header('Content-Type: application/msword');
header('Content-disposition: filename=mydocument.doc')
$doc = New mime10class();
$doc->addFile('mydocument.htm','text/html; charset="utf-8"','Hello, world !');
$doc->addFile('subdir\anotherfile.htm','text/html; charset="utf-8"','Hi there.');
echo $doc->getFile();

Sending the file to the client

The .mht file must be served to the client with the following HTTP headers:

Content-Type: application/msword
Content-disposition: attachment; filename=myfile.doc

Yes, we use ".doc" in order not to confuse the final user. Word will recognize this is a .mht file and will open it accordingly.

Note that:

Content-disposition: attachment; filename=myfile.doc

will force download, but:

Content-disposition: filename=myfile.doc

will allow the user to choose between "open" or "save".

Examples

Full, working HTML documents which load correctly in Microsoft Word, using many Word features: Styling, tables, headers & footers, page format, table-of-content, images…

FIXME

php code examples

FIXME

Performances

I have implemented this system in a professional environment, and we are able generate a 70 pages database-driven dynamically-filled document with lots of tables and references in less than 5 seconds. (The document contains no images; The server runs Apache+php+Oracle. Document templates are written in Smarty.)

The performances are excellent, much better than what I expected.

Ideas worth pondering

  • Using a markup language (Markdown ? Textile ?…) which generates HTML may ease the creation and maintenance of templates.
  • With php, the use of a templating engine like Smarty should help to easily create and maintain document templates without touching php code too much. Template inclusion can help to create - for example - standard headers for all documents.
  • Serving files using gzip compression may improve user experience (our mime_example.doc above goes from 5216 bytes to 2569 bytes with default gzip compression).

Discussion

Benoit, 2011/03/14 17:00

Isn't it possible to embed images this way:

<img src=“data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAA (etc) ==” width=“16” height=“16” />

Sébastien SAUVAGE, 2011/03/14 20:08

mmm… I don't know if/how World will handle these, but that's something to try !

TiTi, 2011/03/14 20:20

Hey there, You should check out TinyButStrong and its OpenTBS plugin

Basically this is : PHP + .docx Template → .docx

Sébastien SAUVAGE, 2011/03/14 21:50

Thank you for the links.

Vegetable, 2011/06/17 07:29

Great article man! Helps a lot ;-)

jpsm, 2011/07/01 02:48

Thank you =)

million, 2011/08/31 20:31

best advise and example for a problem i have been trying to solve for 6 months. thanks a million.

Sébastien SAUVAGE, 2011/09/05 13:23

:-) I'm happy to help.

Muahmmad Usman Shahid, 2011/12/21 09:31

Nice Post!can ou please tell me how to apply outline list style through this ot tell me the reference of that from where you get this!

Thanks.

Sébastien SAUVAGE, 2012/01/31 15:34

I haven't tested how Word handles list costumization, but you can try this: http://www.netmechanic.com/news/vol3/css_no2.htm

Paul, 2012/04/20 12:41

In order to make landscape work you also have to reverse the page dimensions, ie:

size:21cm 29.7cm; /* A4 */

Becomes: size:29.7cm 21cm; /* A4 */

The mso-page-orientation is only to make sure a printer also recognises it is landscape/portrait.

Ionut, 2012/07/13 16:27

This is exactly what I need. thank you!!!!

Toto, 2012/09/13 06:44

Hi,

First of all thank you for this good tutorial. I have some trouble about downloading the doc file. I don't understand the part “Creating a MIME”. The mimefile is it the final document? Because when i open your file 'mime_example.doc' i see the encrypted file. I try to use your class but i still have the same problem, i see the encrypted file. My problem is i can't get the final document with headers/footer and image.

Thanx

Blessing, 2012/11/12 07:13

Hi, Thanks for the wonderful tutorial. Is there a way i can create mhtml files in Java.

Roman, 2013/06/14 13:04

I have a problem with headers. The image prints blurry. If I try to create a high resolution image 1950px wide (300 dpi) then Word for some reason displays it at 313 percent all zoomed in. I can right click on it and set it back to 100%, but since it's on all the pages, that's a lot of work to do for dynamicly generated Word documents. Do you know how to display high resolution images in the header?

Joseph, 2013/07/03 04:50

Thank you! Love the spirit!

Joseph, 2013/07/04 20:14

I'm happy to contribute something…

**change the page-color and add a watermark**

1) add reference to <a:clrMap xmlns:a='http://schemas.openxmlformats.org/drawingml/2006/main' />. Add this reference best immediately after: <link rel=File-List href='filelist.xml'>

2) in the <w:WordDocument></w:WordDocument> section add: <w:DisplayBackgroundShape/>. Thus: <w:WordDocument><w:DisplayBackgroundShape/></w:WordDocument>

3) customize your body style; ie: body { background-color:'yellow'; background-image:url('image.png') }

Jamie, 2013/08/08 07:04

Thanks for the tutorial. Very helpful! I'm trying to generate word documents in both portrait and landscape with different headers and footers. I created the separate header/footer files but when I try to open the word document, it keeps giving me the message “Some of the files in this Web page aren't in the expected location. Do you want to download them anyway? If you're sure the Web page is from a trusted source, click Yes”. When I click Yes, the headers and footers appear. I have a script that converts the generated Word doc into a pdf file and all the headers and footers disappear after the pdf file hase been created.

Ngo Chien, 2013/08/27 10:59

I open mime_example.doc in Microsoft Office (Mac OS X), it not show header/footer and image. How to create mime file for Microsoft Office (Mac OS X) ?

juice, 2013/11/04 21:26

BIG THANKS TO YOU !

mojtaba, 2014/07/13 11:22

hi I want to set the summary information of ms word(doc , docx) file using php… please help me… thanks a lot

David Lopez, 2014/09/02 04:28

Excelent article, and the best part was the section of “Possible Solutions” at the beginning. It coincides with mine, which I could add a couple of relevant points; there's much to say also, but focusing in the relevant:

  • Liste à puce“Use the COM control to pilot Word on the server.”: it is something I've done myself, and it is not that “unstable” as it might say there, I've done it myself several times even since 2006/2007, controlling an Office application totally from outside. It works, BUT the proble * m is that you need a Windows machine to do it, if you try to automate a Linux server with that solution, you'll see yourself messing around with lack of libraries, even if you try with pywin32.
  • HTML generation: it has the big advantage of suggesting ONE SOLUTION TO MANY PROBLEMS. Indeed, the html generation implies that you already have an html page for showing the problems. If the layout changes a little bit for word, you could change the css for the case when you want to print to a browser and another stylesheet when you want to print to word. AND, as you can run wkhtmltopdf as a static binary in a linux shared hosting, you can also upload that binary, and convert the very same html to pdf, and hence the same html could work to generate a pdf file. So, with out losing your time with re-writing the same page for each format, you put all your effort in making good converters and solving each problem ONCE. For that purpose, this article is simply great.

Many thanks, brother, perhaps we can exchange ideas later.

David López

Bogotá, Colombia, South America

Investigación y Programación Sas

heavis, 2014/10/24 10:23

hi! i want to specify first page footer and header. can you help?

RAM, 2014/12/15 13:24

I am amazed at all the research you did… helped me out a lot. 1 stop place for me to understand PHP to Word!

Thanks! AWESOME!

PL Lamballais, 2015/03/04 01:43

Great! Just one point: I use Mac and at the office we have 3 machines. On mine, per defaut Word reads my HTML file as A4, and on other Mac, on US Letter. We don't know why… So I tried to add a @page at top of the file, and notice this change… nothing. In fact, in order to get the whole document as A4, I just need to add a “div” enclosing all the data, and it's work. Hope this can help.

abe, 2015/04/14 13:22

hi.

i have folowed instructions for creation header and footer in the same page but they are not showing in the page

Some of the files in this Web page aren't in the expected

location. Do you want to download them anyway?

anay help?

Geoffrey, 2015/06/18 09:31

It's seems word does support the Content-ID: <image.png> and cid:image.png reference.

So you can do this to include the image in word: Content-ID: <image1.png> Content-Type: image/png Content-Transfer-Encoding: base64

(base64 content here)

and in your html: <img src=“cid:image.png” />

David, 2015/08/11 22:29

Great article,

Any idea of how to do the same without PHP? I mean creating a HTML with word format, modifying its code and text with Javascript and then save it as .doc?

The problem is when you change anything with javascript, It does not appear on the html or doc web saved.

e.g. document.getElementById('mydiv').innerHTML='my sentence'

Thanks,

Paul, 2015/09/09 14:21

Just when I was hunting for a good way to create Word from PHP I came across your article. Extremely helpful, having looked at other possible solutions. Only thing is that the links to examples and PHP code examples are disabled. Is there any way to get these downloads, particularly PHP? I am not sure I can get this to work without these. Thanks indeed!

Vincent, 2016/02/02 08:10

Hi and thanks a lot for this excellent tutorial.

In your example the header and footer are above and below the body text. In some word documents the body margin can be set so the text overlap header and footer. Any idea how to implement this in your example ?

Vincent, 2016/02/21 14:35

I'm happy to contribute too with the following :

**** Different first page footer/header and text overlapping header ****

In order to have a different header/footer on the first page, you have to put the html in the headerfooter.html file as for the normal header and footer, but name them fh1 and ff1. Then in your main html, in the styling part add the following :

mso-title-page: yes; mso-first-header: url(\“mydocument_files/headerfooter.htm\”) fh1; mso-first-footer: url(\“mydocument_files/headerfooter.htm\”) ff1;

That's it^^

Then if you want that your main content text overlap the header (very usefull if you want to write a letter recipient name and address for example), proceed as follow:

Replace the @page Section1 with:

@page Section1 {margin: “-1cm 1cm 1cm 1cm;}

And add : p.MsoHeader, li.MsoHeader, div.MsoHeader {margin:0in; margin-bottom:-50pt;}

Federico, 2016/12/22 09:03

Hi and thanks for this helpfull tutorial!

I've a problem with page break, it doesn't work using word 2016:

text 1

<br clear=all style='mso-special-character:line-break;page-break-before:always'>

text 2

I'd try also to wrap text in a div or a p tag but nothing change

Federico, 2016/12/22 09:17

I forgot to specify that I'm trying to copy past from Chrome

Nick Nicholas, 2018/01/19 01:56

I've found (on Mac Word) that to embed through MIME images inserted that might otherwise overflow overflow the page size, I have to insert appropriate width and height attributes; Word crashes otherwise. If the images are just linked and not embedded in MIME, Word renders them (though then of course it has to ask you permission to use the images on opening).

Nicholas, 2018/02/15 03:14

https://github.com/riboseinc/html2doc is a Ruby gem based on what you've done here.

Harris Gurung, 2018/04/09 09:14

Footer is appended as text inside document while adding footer. I mean there is double footer text one is at the footer which is ok but another is appended in body in doc. Can anyone help out ??

Nick Nicholas, 2018/05/20 08:13

https://github.com/riboseinc/html2doc/wiki/Why-not-docx%3F

As an update to our gem at https://github.com/riboseinc/html2doc, arguing for why we are persevering with the approach sebsauvage has presented here:

* It generates DOC rather than DOCX, which means eventually Word will stop reading it (maybe in a decade?) * But we don't want to learn OOXML either * The approach taken in https://github.com/evidenceprime/html-docx-js lets you import an HTML document into DOCX as an MHT blob. If your documents are relatively simple, this approach will work, is more future-proof (because it generates DOCX instead of DOC), and it does support *some* of Word HTML. * But while this approach supports HTML 5.0 and CSS 3.0, it does *not* support all of Word HTML, and it does not support enough of Word HTML for us to use. (In particular, it does not seem to support mathematical formatting, paragraph spacing, and it only seems to partly support footnotes.)

Ghazi , 2018/11/20 06:50

Hi, Thanks for the excellent article !!!

Is there any way to fill MS word template through Jquery/javascript. I wanna populate them (i mean word templates with merge fields) .

Please help me

Regards

Ghazi

Sébastien SAUVAGE, 2019/05/02 20:30

Hello.

Discussion is now closed on this page.

word_document_generation.txt · Dernière modification : 2019/05/02 20:30 de sebsauvage