3

I have collection of pdf files which stores information in below format:

Line no 1     Line no. 11    
Line no 2     Line no. 12
.             .
.             .
.             .
Line no 10    Line no N

I am using pdfplumber library to extract PDF's text content but, instead of reading from line 1 to 10 at first and then marching towards line 11 (and so on) pdfplumber reads line 1 and line 11 together as a single line. Consider below output:

Line no 1 Line no. 11    
Line no 2 Line no. 12
.             
.             
.             

What I expect:

Line no. 1
Line no. 2
.
.
.
Line no. 11
.
.
.

Here is the link to the pdf which I am trying to read.

Glimpse of PDF: Sample pdf

I tried extract_table() utility from pdfplumber library with table settings, but it didn't work (referred answer https://stackoverflow.com/a/63133876/10011503)

Do I need to pass some specific table setting as argument to pdfplumber.open('path_to_pdf').pages[0].extract_table() or is there any other utility and/or workaround?

s.k
  • 193
  • 1
  • 2
  • 15

1 Answers1

0

I don't see a table in your PDF section above. I suggest you use the

Page.extract_text(...)

call instead.

The readme from the main documentation has an example of extracting fixed-width text at https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb which is more similar to your drug PDF.

rajah9
  • 11,645
  • 5
  • 44
  • 57