Flying Saucer PDF Generator and Unicode

Written by Geoff Mottram (geoff at minaret dot biz).

Placed in the public domain on January 30, 2012 by the author.

Last updated: February 16, 2012.

This document and all associated software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with this document and associated software or the use or other dealings in same.

Contents
Introduction
Repositories
Arial Version of Flying Saucer
Features of this Release
Software Bundles
Quick Start Guide to Generating a PDF File with Flying Saucer
Tips on using Flying Saucer

Introduction
Flying Saucer is a remarkable open-source Java project for converting XTHML files that contain CSS style-sheet information into PDF files. Flying Saucer relies on an equally impressive Java project called iText, which does the actual PDF generation via a programming interface. The beauty of Flying Saucer is how easy it is to generate PDF files from a format that most people are familiar with: HTML.

This document has two purposes: to provide some tips and helpful hints that will get you up and running faster with Flying Saucer, particularly if you are interested in Unicode and non-Latin characters; and to make available both source and binary bundles of a variation of Flying Saucer that contains some new features and fixes.

Repositories
Other than the material provided on this site, the current versions of Flying Saucer and iText can be found here:

Flying Saucer Home Page
iText Home Page
A note about iText: Since version 5, the license has changed and you must pay to use this software in commercial projects unless you make your source code available to your work as well. The license even applies to the PDF files that are generated by iText so that if you use iText to generate PDF files in a commercial setting, you must pay or release your source code.

The last version of iText that does not have this restriction is 2.1.7 which uses Mozilla Public License Version 1.1. You can still find source and binary releases of 2.1.7 out on the Internet. An iText 2.1.7 binary jar file is included in the binary Flying Saucer files on this site. An iText source code bundle and a binary bundle are provided here as well.

A note on the code available from this site:

Arial Version of Flying Saucer
This variant of Flying Saucer is named after the Arial Unicode MS font which was the reason for many of the enhancements to Flying Saucer. This font is incredibly useful if you are doing any work with multiple character sets as it reportedly implements the entire Unicode 2 character repertoire. What it does not have is premade bold and italic variations of the font. The Arial version of Flying Saucer will generate the appropriate PDF commands to emulate bold and italics when a style calls for it but no matching font can be found.

This release (R8-Arial) is based on the Flying Saucer master branch from January 6, 2012. The changes documented here have been uploaded to the Flying Saucer project on GitHub.

WARNING: Be advised that Arial Unicode MS is the copyrighted intellectual property of Microsoft Corporation. It comes bundled with Microsoft Office and portions of it must be embedded in any PDF files you generate with this font for it to display correctly if the computer viewing the PDF file does not have Arial Unicode MS. If you do embed the font in your PDF files, please seek the appropriate legal advice on your use of and, more importantly, your redistribution of any PDF documents generated with any font that has restrictions on its use. If you don't embed the font, you can avoid this problem altogether but the document will not display properly if the font is not installed on the target computer.

Features of this Release
This is the Arial variation of Flying Saucer master branch, release 8. The following enhancements are included:

  1. Added font emulation for bold and italics variations when there is no direct support in the font files themselves. Fonts like Microsoft's Arial Unicode MS only come in one version: plain text. In order to have bold, italics and bold+italics the font must be modified on-the-fly by the PDF display software.

    When a XHTML style calls for a bold or italics variant of a font and if the currently selected font does not have built-in support for these variations, Flying Saucer will now output the necessary PDF instructions to emulate these effects.

  2. Added support in ITextRenderer for setting the PDF document properties author, keywords, subject and title and using metadata elements in the XHTML document's <head> section. They are used like this:
         <meta name="author" content="Your name here" />
         <meta name="keywords" content="Keyword1, Keyword2, This is one phrase, This is another" />
         <meta name="subject" content="The subject of this document" />
         <meta name="title" content="The document title" />
    
    In the case of the PDF document title property, it will be automatically set using the contents of the <title> element in the <head> section of the XHTML source document. However, a metadata title name/content pair will override any <title> element found in the head.

    Examples:
         <title>A title in the head section of the document</title>
         <meta name="title" content="This title overrides the title element" />
    
  3. Metadata from an XHTML document is stored in the ITextOutputDevice object. All meta name/content pairs found in the <head> of the XHTML document is extracted after a document is loaded (not just author, keywords, subject and title) and can be accessed and modified through the ITextOutputDevice methods (the metadata name is not case sensitive, so title and TITLE are equivalent):
              public void addMetadata(String name, String value)
              public String getMetadataByName(String name)
              public ArrayList getMetadataListByName(String name)
    
  4. Added a new event to the PDFCreationListener interface called preWrite() that is fired by ITextRenderer after the PDF content has been generated but immediately before the PDF file is written out. The preWrite() method is passed the number of pages in the PDF file. This is the last opportunity to modify the metadata associated with the ITextOutputDevice that will be set in the PDF file's basic properties.
         void preWrite(ITextRenderer iTextRenderer, int pageCount);
    
  5. Fix. There was a rendering problem when specifying a line-height property that is smaller than the maximum character height of a font. This is not unusual when using a font with broad Unicode support like Arial Unicode MS. It has a maximum character height that is much larger than most Latin-based character sets. The default vertical spacing for a font like this will leave more white space between the lines than you may want. The solution is to use the line-height property.

    When you used a line height that was smaller than a font's maximum glyph height, Flying Saucer was generating phantom text that bled from the first line of a page onto the bottom of the page immediately preceeding it. While the phantom text was not visible in the PDF document, your cursor could land on it and you could find it when running a search in Adobe Reader (it displayed as a blue rectangle where the text was hiding).

    The solution used here is to disable the code in InlineBoxing.java that centers lines of text vertically within the currently defined line height (search for any references to halfLeading in InlineBoxing.java). There may be a more elegent solution but this works for the time being. If a line of text has extra vertical space, that space will always follow below the text.

  6. Fix. Per the CSS specification for paged media, the Flying Saucer code has been changed so that the first page is a right page, which means that all odd page numbers are right pages and all even page numbers are left pages.

Software Bundles
The Flying Saucer code available here is the Arial branch of Flying Saucer Release 8, compiled with Java 6. If you just want to run the binary version, all you need to download is the first bundle. It contains iText version 2.1.7 for Java 6.

The iText source and binary code available here is version 2.1.7 with no modifications and has been compiled with Java 6 (a.k.a. 1.6). The binary supports PDF file encryption and includes the following jar files from Bouncy Castle:

bcmail-jdk16-146.jar
bcprov-jdk16-146.jar
bctsp-jdk16-146.jar

Note: If you live in a country that does not permit you to possess cryptography tools, don't download the binary version of iText available here. The Flying Saucer binary does not include these encryption libraries.

To download a bundle, right click on one of the following links and choose Save As or Save Link As. Also download its related sha1sum (in the *.asc.txt file) or md5sum (in the *.md5.txt file). To verify all of your bundles, run either sha1sum *.asc.txt or md5sum *.md5.txt after you have finished with your downloads.

Quick Start Guide to Generating a PDF File with Flying Saucer

  1. Create a directory for Flying Saucer (i.e. /usr/local/flyingsaucer).

  2. Download a Flying Saucer binary bundle (i.e. flyingsaucer-R8_Arial.zip) into that directory.

  3. Optionally download a checksum and verify your bundle.

  4. Unzip your files.

  5. Create a shell script called fs that contains the following:
    #!/bin/sh
    CLASSPATH="/path/to/jar/files/core-renderer.jar:/path/to/jar/files/iText-2.1.7.jar"
    
    java org.xhtmlrenderer.simple.PDFRenderer $*
    
  6. Run the script as follows:
    fs url pdf [version]
    
    where: url is the file name or URL of a XHTML source document.
           pdf is the file name of the PDF to create.
           version is an optional PDF version number between 1.2 and 1.7 (default is 1.2)
    

Tips on using Flying Saucer

  1. When producing your XHTML file, use UTF-8 encoding.

  2. To make the encoding explicit to the XML parser, the first line of your XHTML file should be:
    <?xml version="1.0" encoding="UTF-8"?>
    
  3. In the style section of your xhtml document (or in a separate style sheet), register any special fonts you will be using and indicate where the font files are located. For example, to register Arial Unicode MS:
    @font-face {  
      src: url(file:///Absolute/path/to/font/directory/ARIALUNI.TTF);  
      -fs-pdf-font-embed: embed;  
      -fs-pdf-font-encoding: Identity-H;
    }  
    
    Note the two Flying Saucer extensions to the @font-face style directive. The -fs-pdf-font-embed: embed; line directs the PDF generator to embed any portions of the font that your document needs to display properly. Without this line, your document will display properly only if the recipient has this font. By not embedding the font, you will avoid any potential copyright and licensing issues with the owner of this font but the document might not look as good at the other end. By embedding this font, you are insuring that the document is self-contained.

    The -fs-pdf-font-encoding: Identity-H; line instructs the PDF display software that this is a Unicode font and not a font that is restricted to certain code pages. This line is very important.

  4. When using Arial Unicode MS, you may notice that there is too much whitespace between lines of text. This is because the line spacing used by Flying Saucer (and Microsoft Word) is based on the tallest characters in the entire font. Most Latin based languages have characters that are much shorter than that. Use the line-height property to reduce the amount of whitespace between each of your lines of text. For example:
    body {
        font-family: Arial Unicode MS, Lucida Sans Unicode, Arial, sans-serif;
        font-size: 6.8pt;
        line-height: 1.17;
    }
    
    The line-height property is a value that is multiplied by the font-size to set the vertical line height. In this example, the line height will be 7.956 points (6.8 x 1.17). It may seem counter-intuitive that a line height that is 17% larger than the font size should be smaller than the default line height but that is because, at least for this font, a 6.8 point size has a maximum character size of 11.1 points (63% larger than the nominal font size). FYI, Arial Unicode MS has the following font metrics:
         unitsPerEm = 2048
         xMax = 4629
         xMin = -2071
         yMax = 2200
         yMin = -572
    

    The maximum height of a character is 2200 (ascender) plus 572 (descender) for a total of 2772 font units or 1.63 em units. Mutiply this number by the font size in points to get your maximum glyph size in points.

    Notice that your font size does not have to be a whole number (i.e. 6.8 points).

  5. If you use tables to control text indentation and justification and you don't want the table to introduce any additional whitespace into your document, you can define the following styles (note that some of these may be unneccessary but it works):
    table {
        -fs-border-spacing-horizontal: 0;
        -fs-border-spacing-vertical: 0;
        border-spacing: 0;
        border-style: none;
        border-width: 0;
        border: 0;
        padding: 0;
        margin: 0;
    }
    td {
        border-style: none;
        border-width: 0;
        border: 0;
        margin: 0;
        padding: 0;
    }
    
    The Flying Saucer style extensions (-fs-border-spacing-horizontal: 0; and -fs-border-spacing-vertical: 0;) are really critical to making this work.

  6. Flying Saucer supports the @page CSS property to control paper size, margins, headers and footers. In addition to named pages (i.e. @page rotated) there are also three pseudo-class properties (@page:first, @page:left and @page:right) that control certain types of pages. Per the CSS specification for paged media, the Arial version of Flying Saucer defines the first page as a right page, which means that all odd page numbers are right pages and all even page numbers are left pages. In the master version of Flying Saucer release 8, it is the other way around.

    Here is an example of how you can define two different page layouts for printing a book in which the side of the page closest to the binding (the inside of the page) has a larger margin (72 points) than the outside (60 points). There is also a footer centered at the bottom of every page with the letter A, a dash and the current page number.
    /* Odd page numbers */
    @page:right {
        size: 9.25in 11.25in;
        margin: 32pt 0 40pt 72pt;
        padding: 0;
        @bottom-center {
    	content: "A-" counter(page);
    	font-family: Arial Unicode MS, Lucida Sans Unicode, Arial, sans-serif;
    	font-size: 6.8pt;
        }
    }
    /* Even page numbers */
    @page:left {
        size: 9.25in 11.25in;
        margin: 32pt 0 40pt 60pt;
        padding: 0;
        @bottom-center {
    	content: "A-" counter(page);
    	font-family: Arial Unicode MS, Lucida Sans Unicode, Arial, sans-serif;
    	font-size: 6.8pt;
        }
    }
    
  7. Here is an example of how the PDFRenderer class (the same one used in the Quick Start Guide, above) can be customized to utilize the new preWrite() method of the PDFCreationListener interface:
    Min2pdf.java

    This particular example was used in conjunction with a XHTML generating application that performed its own line wrapping and page breaks. However, since Flying Saucer will wrap lines that are too long and and insert page breaks when a page is too long, it was important to know whether the number of pages generated by the source application matched the final PDF page count.

    The two applications (XHTML generator and Min2pdf converter) communicated by adding a page count in an XML comment at the end of the XHTML file. For a 10 page document, the last line would look like this:

         <!-- 10 source pages -->
    
    The preWrite() method in Min2PDF.java accesses the parsed XHTML document tree to locate the last child of the document. If this node is a comment, the text within the comment is parsed for the page count. Based on whether this page count matches the number of pages in the final PDF document (which is passed as an argument to preWrite()), a text line is generated and saved as a subject metadata item in this XHTML document, which will be used as the subject property of the generated PDF file. The Min2PDF class also sets the exit value for this process to indicate to the caller whether the conversion was a success (0), a rendering error occurred (1) or if there was a page count error (2).

    Note the commented-out line at the start of preWrite():

         //String s = od.getMetadataByName("subject");   // existing subject line
    
    This is an another way that information can be passed from the XHTML generator to the PDF generator -- using meta tags in the head section of the XHTML document. While only author, keywords, subject and title are ultimately stored in the PDF file (as document attributes), other meta name/content pairs can be used for inter-application communication.

    The metadata technique only works for information that is already known at the very beginning of the XHTML document. The final page count is only known after all pages have been output, so an XML comment at the end of the document is easier to implement.

Technical Tips