0

TL;DR

How do you create a PDF from a JSON object that contains a String written in HTML.

Example JSON:

{
  dimensions: {
    height: 297,
    width: 210
  },
  boxes: [
    {
      dimensions: {
        height: 10,
        width: 190
      },
      position: {
        x: 10,
        y: 10
      },
      content: "<h1>Hello StackOverflow</h1>, I think you are <strong></strong>! I hope someone can answer this!"
    }
  ]
}

Tech used in front-end: AngularJS 1.4.9, ui.tinymce, ment.io

Back-end: whatever works.

I want to be able to create templates for PDFs. The user writes some text in a textarea, uses some variable that will later be replaced with actual data, and when the user presses a button, a PDF should be returned with the finished product. This should be very generic. So it would be able to be used in pretty much anything.

So, minimal example: The user writes a little text in TinyMCE like

<h1>Hello #[COMMUNITY]</h1>, I think you are <strong>great</strong>! I hope someone can answer this!

This text contains two variables that the user gets with the help of the ment.io plugin. The actual variables is supplied from the controller. This text is written in an AngularJS version of TinyMCE which also has Ment.io on it which supplies a nice view of available variables.

When the user presses the Save button, a JSON object like the following is created, which is the template.

{
  dimensions: {
    height: 297,
    width: 210
  },
  boxes: [
    {
      dimensions: {
        height: 10,
        width: 190
      },
      position: {
        x: 10,
        y: 10
      },
      content: "user input"
    }
  ]
}

I have a directive in Angular that can generate any number of boxes really, in any size (generic-ho!). This part works great. Simply send in how big you want the 'page' (in mm, so the example says A4-paper size) in the first dimensions object as you see in the object. Then in the boxes you define how big they should be, and where on the 'paper' it should go. And then finally the content, which the user writes in a TinyMCE textarea.

Next step: The back-end replaces the variables with actual data. Then pass it on to the generator.

Then we come to the tricky part: The actual generator. This should accept, preferably, JSON. The reason for this is because any project should be able to use it. The front-end and the PDF-generator goes hand in hand. They don't care what's in the middle. This means that the generator can be written in pretty much anything. I'm a Java-developer though, so Java is preferable (hence the Java-tag).

Solutions I've found are:

PDFbox, but the problem with using that is the content that TinyMCE produces. TinyMCE outputs HTML or XML. PDFBox does not handle this, at all. Which means I have to write my own HTML or XML parser to try and figure out where the user wants bold-text, and where she wants italics, headings, other font, etc. etc. And I really don't want that. I've been burned on that before. It is on the other hand great for placing the text in the correct places. Even if it is the raw text.

I've read that iText does HTML. But the the AGPL-license pretty much kills it.

I've also looked at Flying Saucer that takes XHTML and creates a PDF. But it seems to rely on iText.

The solution I'm looking at now is a convoluted way to use Apache FOP. FOP takes an XSL-FO object to work on. So the trouble here is to actually dynamically create that XSL-FO object. I've also read that the XSL-FO standard has been dropped, so unsure how future-proof this approach will be. I've never worked with neither FOP nor XSLT. So the task seems daunting. What I'm currently looking at is taking in the output from TinyMCE, run that through something like JTidy to get XHTML. From the XHTML create a XSLT file (in some magical way). Create a XSL-FO object from the XHTML and XSLT. And the generate the PDF from the XSL-FO file. Please tell me there is an easier way.

I can't have been the first to want to do something like this. Yet searching for answers seems to yield very few actual results.

So my question is basically this: How do you create a PDF from a JSON-object like the above, which contains HTML, and get the resulting text to look like it does when you write it in TinyMCE? Have in mind that the object can contain an unlimited number of boxes.

qwelyt
  • 776
  • 9
  • 23
  • Why down-vote? What can I make more clear? – qwelyt Feb 18 '16 at 13:39
  • There is little confusion around the flying saucer licensing. I assume flying saucer uses older version of the itext which is not AGPL. I think I will ask this at flying saucer to confirm this. – Biswanath Oct 04 '16 at 14:33
  • flying saucer uses 2.1.7 of iText which is MPL. https://mvnrepository.com/artifact/com.lowagie/itext/2.1.7. Any version after this is AGPL. – Biswanath Oct 04 '16 at 14:42
  • Also please see this thread regarding iText licensing. – Biswanath Oct 04 '16 at 14:53

1 Answers1

1

So. After some research and work I decided to actually go with PDFbox for the generation. I've also been very strict about what I accept as content input. Right now, I really just accept bold, italics and headings. So I look for <strong>, <em>, and <h[1-6]> tags.

To begin with, I updated my input JSON a bit, more wrapping really.

{
   [
      documents: [
        {
          pages: [
            {
              dimensions: {width: 210, height, 297},
              boxes: [
                dimensions: {width: 190, height: 40},
                placement: {x: 10, y, 10},
                content: "Hello <strong>StackOverflow</strong>!"
              ]
            }
          ]
        }
      ]
   ]
}

And the reason is because I want to be able to put out lots and lots of documents in the same PDF. Think if you are doing a mass send out of letters. Each document is slightly different, but you still want it all in the same PDF. You could of course do this all with just the pages level, but if one document is several pages, it's nicer to have the separated, I think.

My actual code is about 500 lines long, so I won't paste it all here, just the basic parts to be of help, and that' still around 150 lines. Here goes:

public class Generator {
   public static ByteArrayOutputStream generatePDF(final Bundle bundle) {
      final ByteArrayOutputStream output = new ByteArrayOutputStream();

      pdf = new PDDocument();
      for (final Document document : bundle.documents) {
         for (final Page page : document.pages) {
            pdf.addPage(generatePage(pdf, page));
         }
      }
      pdf.save(output);
      pdf.close();

      return output;
   }

   private static generatePage(final PDDocument document, final Page page) {
      final PDRectangle rect = new PDRectangle(mmToPoints(page.dimensions.width)mmToPoints(page.deminsions.height));
      final PDPage pdPage = new PDPage(rect);
      final PDPageContentStream cs = new PDPageContentStream(document, pdPage);

      for (final Box box : page.boxes) {
         resetFont(cs); // Reset the font when starting new box so missing ending tags don't mess up the next box.

         final String pc = processContent(box.content); // Make the content prettier. Eg. strip all <p>, replace </p> with \n, strip all <div> tags, etc.

         lines(Arrays.asList(processContent.split("\n")), box, cs);
      }
      cs.close();
      return pdPage;
   }

   private static float mmToPoints(final float mm) {
      // 1 inch == 72 points (standard DPI), 1 inch == 25.4mm. So, mm to points means (mm / inchInmm) * pointsInInch
      return (float) ((mm / 25.5) * 72);
   }

   private static lines(final List<String> lines, final Box box, final PDPageContentStream cs) {
      if (lines.size() == 0) { return; }
      cs.beginText();
      cs.moveTextPositionByAmount(mmToPoints(box.placement.x), mmToPoints(box.placement.y));
      // Now we begin the tricky part
      for (int i = 0, length = lines.size; i < length; ++i) {
         final String line = lines.get(i);
         final List<Word> wordList = new ArrayList<>();
         final String[] splitArray = line.split(" ");
         final float fontHeight = fontHeight(currentFont(), currentFontSize()); // Documented elsewhere
         cs.appendRawCommands(fontHeight + " TL\n");
         if (i == 0) { addNewLine(cs); } // PDFbox starts at the bottom, we start at the top. Add new line so we are inside the box
         for (final String index : splitArray) {
            final String word = index + " "; // We removed spaces when we split on them, add it to words now.
            final StringBuilder wordBuilder = new StringBuilder();
            boolean addWord = true;
            for (int j = 0; wordLength = word.length(); j < wordLength ;                ++j){
               final char c = word.charAt(j);
               if (c == '<') { // check for <strong> and those
                  final StringBuilder command = new StringBuilder();
                  if (addWord && wordBuilder.length() > 0) {
                     wordList.add(new Word(wordBuilder.toString(), currentFont(), currentFontSize()));
                     wordBuilder.setLength(0);
                     addWord = false;
                  }
                  for (; j < wordLength; ++j) {
                     final char c1 = word.charAt(j);
                     command.append(c1);
                     if (c1 == '>') {
                        if (j + 1 < wordLength) { addWord = true; }
                        break;
                     }
                  }
                  final boolean b = parseForFontChange(command.toString());
                  if (!b) { // If it wasn't a command, we want to append it to out text
                     wordBuilder.append(command.toString());
                  }
               } else if (c == '&') { // check for html escaped entities
                  final int longestHTMLEntityName = 24 + 2; // &ClocwiseContourIntegral;
                  final StringBuilder escapedChar = new StringBuilder();
                  escapedChar.append(c);
                  int k = 1;
                  for (; k < longestHTMLEntityName && j + k < wordLength; ++k) {
                     final char c1 = word.charAt(j + k);
                     if (c1 == '<' || c1 == '>') { break; } // Can't be an espaced char.
                     escapedChar.append(c1);
                     if (c1 == ';') { break; } // End of char
                  }
                  if (escapedChar.indexOf(";") < 0) { k--; }
                  wordBuilder.append(StringEspaceUtils.unescapedHtml4(escapedChar.toString()));
                  j += k;
               } else {
                  wordBuilder.append(c);
               }
            }
            if (addWord) {
               wordList.append(new Word(wordBuilder.toString(), currentFont(), currentFontSize()));
            }
         }
         writeWords(wordList, box, cs);
         if (i < length - 1) { addNewLine(cs); }
      }
      cs.endText();
   }

   public static void writeWords(final List<Word> words, final Box box, final PDPageContentStream cs) {
      final float boxWidth = mmToPoints(box.dimensions.width);
      float lineWidth = 0;
      for (final Word word : words) {
         lineWidth += word.width;
         if (lineWidth > boxWidth) {
            addNewLine(cs);
            lineWidth = word.width;
         }
         if (lineWidth > boxWidth) { // Word longer than box width
            lineWidth = 0;
            final String string = word.string;
            for (int i = 0, length = string.length(); i < length; ++i) {
               final char c = string.charAt(i);
               final float charWidth = calculateStringWidth(String.valueOf(c), word.font, word.fontSize);
               lineWidth += charWidth;
               if (lineWidth > boxWidth) {
                  addNewLine(cs);
                  lineWidth = charwidth);
               }
               drawChar(c, word.font, word.fontSize, cs);
            }
         } else {
            draWord(word, cs);
         }
      }
   }
}

public class Word {
   public final String string;
   public final PDFont font;
   public final float fontSize;
   public final float width;
   public final float height;

   public Word(final String string, final PDFont font, final float fontSize) {
      this.string = string;
      this.font = font;
      this.fontSize = fontSize;
      this.width = calculateStringWidth(string, font, fontSize);
      this.height = calculateStringHeight(string, font, fontSize);
   }
}

I hope this helps someone else facing the same problem. The reason to have a Word class is if you want to split on words, rather than chars. Lots of other posts describe how to use some of these helper methods, like calculateStringWidth etc. So They are not here.

Check How to Insert a Linefeed with PDFBox drawString for newlines and fontHeight.

How to generate multiple lines in PDF using Apache pdfbox for string width.

In my case the parseForFontChange method changes the current font and font size. What's active is of course returned by the method currentFont() and currentFontSize. I use regexes like (?ui:(<strong>)) to check if a bold-tag was in there. Use what suits you.

Community
  • 1
  • 1
qwelyt
  • 776
  • 9
  • 23