Extract tables from multi-column pdf using Python

Question

I have a pdf in the following format

Lorem ipsum dolor sit amet, consectetur        |Table 2                                        | 
adipiscing elit. Praesent in tortor consequat, |+---------------------------------------------+|
rutrum dolor fringilla, gravida felis.         ||              |               |              ||
Suspendisse quis condimentum diam, ut congue   ||              |               |              ||
quam.                                          |+---------------------------------------------+|
                                               ||              |               |              ||
Table 1                                        ||              |               |              ||
+---------------------------------------------+|+---------------------------------------------+|
|              |               |              ||Lorem ipsum dolor sit amet, consectetur        |
|              |               |              ||adipiscing elit. Praesent in tortor consequat, |
|              |               |              ||rutrum dolor fringilla, gravida felis.         |
|              |               |              ||Suspendisse quis condimentum diam, ut congue   |
+---------------------------------------------+|quam.                                          |
                                               |                                               |
Lorem ipsum dolor sit amet, consectetur        |                                               |
                                               |                                               |

and am trying to extract the two tables named as Table 1 and Table 2. I have the following code right now:

df = tabula.read_pdf("path_to_pdf")

but it recognises the whole page as a table with two columns instead of returning the two tables: Table 1 and Table 2

Output right now: A table with two columns: First column being the left column of this page and second column being the right column of this page

Output needed: Two tables with three columns each: Table 1 and Table 2

score 1 · Answer 1 · answered Dec 18 '20 at 17:13

1

Have you tried the "multiple_tables" argument?

df = tabula.read_pdf(file_path, multiple_tables=True)

As noted in the Tabula Python Docs:

https://tabula-py.readthedocs.io/en/latest/faq.html#i-want-to-extract-multiple-tables-from-a-document

answered Dec 18 '20 at 17:13

BryanLikesToProgram

73
7

Yupp, it gives the same output – Eagle Dec 18 '20 at 18:00
Does the output contain both tables? (even if they're children of the parent df) If so you can just pull the subset out of the parent dataframe. It's significantly easier if the tables have headings & labels. – BryanLikesToProgram Dec 18 '20 at 18:17

Extract tables from multi-column pdf using Python

1 Answers1