python pdfplumber: extract pdf with data split into 2 columns

Question

I have collection of pdf files which stores information in below format:

Line no 1     Line no. 11    
Line no 2     Line no. 12
.             .
.             .
.             .
Line no 10    Line no N

I am using pdfplumber library to extract PDF's text content but, instead of reading from line 1 to 10 at first and then marching towards line 11 (and so on) pdfplumber reads line 1 and line 11 together as a single line. Consider below output:

Line no 1 Line no. 11    
Line no 2 Line no. 12
.             
.             
.

What I expect:

Line no. 1
Line no. 2
.
.
.
Line no. 11
.
.
.

Here is the link to the pdf which I am trying to read.

Glimpse of PDF:

I tried extract_table() utility from pdfplumber library with table settings, but it didn't work (referred answer https://stackoverflow.com/a/63133876/10011503)

Do I need to pass some specific table setting as argument to pdfplumber.open('path_to_pdf').pages[0].extract_table() or is there any other utility and/or workaround?

score 0 · Answer 1 · answered Sep 11 '20 at 13:16

I don't see a table in your PDF section above. I suggest you use the

Page.extract_text(...)

call instead.

The readme from the main documentation has an example of extracting fixed-width text at https://github.com/jsvine/pdfplumber/blob/stable/examples/notebooks/san-jose-pd-firearm-report.ipynb which is more similar to your drug PDF.

python pdfplumber: extract pdf with data split into 2 columns

1 Answers1