Cookbook
- Learn phpdocx in 5 minutes
- Tutorial
- API quick guide
- HTML to Word
- HTML Extended
- Conversion plugin
- Word to HTML
- DOCXPath
- Bulk processing
- DOCXCustomizer
- Digital signature
- Cryptophpdocx
- Right to left languages
- phpdocx CLI command
- Tracking
- Artificial Intelligence
- Blockchain for documents
- JavaScript API
- Compiled mode
Tips to convert HTML to Word
The embedHTML method and its counterpart for templates, replaceVariableByHTML, allow to convert HTML with CSS to Word, while respecting to the maximum their contents and styles. To achieve the maximum similarity with the original HTML and avoid any errors, it is necessary to follow some good practices.
Supported tags and styles
phpdocx supports nearly all the HTML tags and CSS styles that have an equivalent in MS Word.
In our web you cand find HTML to DOCX documentation and the complete list of compatible tags and styles.
When working with HTML 5 tags (such as 'section' or 'main') and an old version of Tidy, you may need to upgrade to the latest release of Tidy to set styles correctly. Otherwise, some styles may not be applied to these tags.
Beside these HTML tags and CSS styles, when importing HTML you can assign too existing Word styles to classes, ids or specific tags with the option wordStyles.
HTML Extended and CSS Extended features allow to use custom HTML tags and CSS styles to invoke the library methods, and thus add contents and styles not available in the standard HTML. Thanks to this functionality it is possible to use HTML to insert headers, footers, comments, TOCs, page number, WordFragments and many other contents and styles.
Tidy, incorrect tagging, accents and other non ASCII characters
For a proper HTML import, it is mandatory that the tags and styles are correctly opened and closed. In other words, that the structure of the code is right. phpdocx uses the PHP extension Tidy (http://php.net/manual/en/book.tidy.php) to correct the HTML and generate a valid tagging. You can install this extension in any operating system with PHP.
To import HTML with accents, we also recommend installing the PHP mbstring extension to auto detect mime encoding.
Warning
If you haven't installed the Tidy extension, errors may ocurr, like appearing the CSS styles in the document, import with errors the HTML or not displaying accents and other non ASCII characters.
Transforming HTML without Tidy
Although using PHP Tidy is highly recommended when using embedHTML and replaceVariableByHTML, these methods include the forceNotTidy option to ignore the Exception thrown by phpdocx if PHP Tidy is not available. This option is only recommended if PHP Tidy can't be installed.
Enabling this option and adding special entity codes such as letters with accents or symbols, require working with HTML entities and UTF-8.
This can be done by setting the UTF-8 encoding in the HTML string
Or automatically using htmlentities:
Or manually in the HTML string to be transformed:
Defining widths in tables
In order to correctly assign widths to tables' columns it is advisable to define the width of the table as well as its cells. You can choose between percentual values or fixed widths, the latter being the recommended choice. You cannot combine both, e.g., choosing a 10% width for one column and 400px for another.
Divide and Optimize
Although the import of HTML and CSS is optimized to the maximum, transforming thousands of lines with different tagging and styles may affect performance.
The solution to achieve the best possible performance is to divide the code you are importing. E.g.: instead of adding with embedHTML an HTML file of 10000 lines, you could divide it in five HTML files and then call embedHTML for each HTML.
With this easy step you can decrease exponentially CPU and memory consumption.
Extra blank spaces added to the beginning of paragraphs
HTML to DOCX methods use PHP Tidy to repair HTML contents automatically. A few versions of PHP Tidy don't work correctly when the default wrap value is 0 (no wrapping), and add extra line breaks to the HTML, so a blank space may appear at the beginning of paragraphs.
phpdocx 10 and newer releases use a very high wrap value (9999999999) to avoid this bug from specific PHP Tidy versions. The disableWrapValue option can also be used to avoid using the wrap value from phpdocx and use the value set in the PHP Tidy config file.