How to parse a PDF in nodejs

Question

I am trying to parse a pdf and categorize information based on text formatting/decoration. How do you suggest I do that? For example, I have a pdf in which the structure is repeated: S.No. BOLD+UNDERLINED TITLE para

How do I categorize this data into an array of objects based on text decoration:

[ 
  { sno: "", title: "", desc: "" }, 
  ... 
]

score 2 · Accepted Answer · answered May 18 '20 at 06:04

I went through the documentation for pdf2json and figured that I might have to use pdfData.formImage.Pages[pageNumber].Texts[wordNumber].R[0] object after parsing the pdf to get hold of values I need.

The property TS of the above object is an array, the value at TS[2] corresponds to whether the text is bold (value = 1) or not (value = 0). I could not find any details on data related to underline text-decoration.

I also needed to initialize the parser as follows: let pdfParser = new PDFParser(null, 1).
Check this for more details.

How to parse a PDF in nodejs

1 Answers1