How to extract tables data from pdf in python?

Asked Jul 05 '19 at 07:14

Active Jul 05 '19 at 07:16

Viewed 62 times

I used PyPDF2, Tabula, Camelot, pdfminer libraries but I got Data alignment issues and Data loss. Is there any solution ?

edited Jul 05 '19 at 07:16

eyllanesc

asked Jul 05 '19 at 07:14

Pavan Prayaga

1

Hi, and welcome! Can you give example of the issues? Please read [ask] – betontalpfa Jul 05 '19 at 07:18
The Data of each Column is coming into one cell – Pavan Prayaga Jul 05 '19 at 07:26
I mean, give a [mcve] – betontalpfa Jul 05 '19 at 07:27
1,305 1,239 24,004...the values 1,305 1,239 24,004 should come in different rows in a csv but i am getting all the these 3 values in a single cell one below the another. – Pavan Prayaga Jul 05 '19 at 07:29
3

Show us your code: We cannot tell you what goes wrong simply based on the result – nicolallias Jul 05 '19 at 07:31
I think all packages that extract data from a pdf source are based on the pdf template. on that template, and not like DOCX files, there are no cells as far as I know... so it would be difficult to extract your data based on that sample. – Kaies LAMIRI Jul 05 '19 at 08:01
import camelot tables = camelot.read_pdf(h, pages="all") table_count = len(tables) for i in range(table_count): tables.export("pdf_test_csv", f="csv", compress=True) tables[i].parsing_report tables[i].to_csv(temp + key_string + "_" + str(i) + ".csv", index=False) – Pavan Prayaga Jul 05 '19 at 10:55
@Pavan Prayaga: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 – 123456 May 28 '22 at 09:35

0 Answers0