I used PyPDF2, Tabula, Camelot, pdfminer libraries but I got Data alignment issues and Data loss. Is there any solution ?
Asked
Active
Viewed 62 times
0
-
1Hi, and welcome! Can you give example of the issues? Please read [ask] – betontalpfa Jul 05 '19 at 07:18
-
The Data of each Column is coming into one cell – Pavan Prayaga Jul 05 '19 at 07:26
-
I mean, give a [mcve] – betontalpfa Jul 05 '19 at 07:27
-
1,305 1,239 24,004...the values 1,305 1,239 24,004 should come in different rows in a csv but i am getting all the these 3 values in a single cell one below the another. – Pavan Prayaga Jul 05 '19 at 07:29
-
3Show us your code: We cannot tell you what goes wrong simply based on the result – nicolallias Jul 05 '19 at 07:31
-
I think all packages that extract data from a pdf source are based on the pdf template. on that template, and not like DOCX files, there are no cells as far as I know... so it would be difficult to extract your data based on that sample. – Kaies LAMIRI Jul 05 '19 at 08:01
-
import camelot tables = camelot.read_pdf(h, pages="all") table_count = len(tables) for i in range(table_count): tables.export("pdf_test_csv", f="csv", compress=True) tables[i].parsing_report tables[i].to_csv(temp + key_string + "_" + str(i) + ".csv", index=False) – Pavan Prayaga Jul 05 '19 at 10:55
-
@Pavan Prayaga: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 – 123456 May 28 '22 at 09:35