2

I used the code from this tutorial http://ourcodeworld.com/articles/read/405/how-to-convert-pdf-to-text-extract-text-from-pdf-with-javascript to set up the pdf to text conversion.

Looked all over on this site https://mozilla.github.io/pdf.js/ for some hints as to how to format the conversion, but couldn't find anything. I am just wondering if anyone has any idea of how to display line breaks as \n when parsing text using pdf.js.

Thanks in advance.

Thomas Valadez
  • 1,697
  • 2
  • 22
  • 27
  • Have you tried replacing any `\r` with `\\r` and same with `\n` to `\\n` with something like `string.replace('\r','\\r').replace('\n','\\n');`?, note: for those who don't know `\r` (carriage return) is commonly paired with a newline character in some environments (i.e. windows) – Patrick Barr Jun 05 '17 at 19:50
  • yeah, I tried. except the `\n` doesn't ever exist. I am worried that the pdf.js just overlooks new line characters. – Thomas Valadez Jun 05 '17 at 19:57

1 Answers1

9

In PDF there no such thing as controlling layout using control chars such as '\n' -- glyphs in PDF positioned using exact coordinates. Use text y-coordinate (can be extracted from transform matrix) to detect a line change.

var url = "https://cdn.mozilla.net/pdfjs/tracemonkey.pdf";
var pageNumber = 2;
// Load document
PDFJS.getDocument(url).then(function (doc) {
  // Get a page
  return doc.getPage(pageNumber);
}).then(function (pdfPage) {
  // Get page text content
  return pdfPage.getTextContent();
}).then(function (textContent) {
  var p = null;
  var lastY = -1;
  textContent.items.forEach(function (i) {
    // Tracking Y-coord and if changed create new p-tag
    if (lastY != i.transform[5]) {
      p = document.createElement("p");
      document.body.appendChild(p);
      lastY = i.transform[5];
    }
    p.textContent += i.str;
  });
});
<script src="https://npmcdn.com/pdfjs-dist/build/pdf.js"></script>
async5
  • 2,505
  • 1
  • 20
  • 27