0

I used PyPDF2, Tabula, Camelot, pdfminer libraries but I got Data alignment issues and Data loss. Is there any solution ?

eyllanesc
  • 235,170
  • 19
  • 170
  • 241
  • 1
    Hi, and welcome! Can you give example of the issues? Please read [ask] – betontalpfa Jul 05 '19 at 07:18
  • The Data of each Column is coming into one cell – Pavan Prayaga Jul 05 '19 at 07:26
  • I mean, give a [mcve] – betontalpfa Jul 05 '19 at 07:27
  • 1,305 1,239 24,004...the values 1,305 1,239 24,004 should come in different rows in a csv but i am getting all the these 3 values in a single cell one below the another. – Pavan Prayaga Jul 05 '19 at 07:29
  • 3
    Show us your code: We cannot tell you what goes wrong simply based on the result – nicolallias Jul 05 '19 at 07:31
  • I think all packages that extract data from a pdf source are based on the pdf template. on that template, and not like DOCX files, there are no cells as far as I know... so it would be difficult to extract your data based on that sample. – Kaies LAMIRI Jul 05 '19 at 08:01
  • import camelot tables = camelot.read_pdf(h, pages="all") table_count = len(tables) for i in range(table_count): tables.export("pdf_test_csv", f="csv", compress=True) tables[i].parsing_report tables[i].to_csv(temp + key_string + "_" + str(i) + ".csv", index=False) – Pavan Prayaga Jul 05 '19 at 10:55
  • @Pavan Prayaga: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 – 123456 May 28 '22 at 09:35

0 Answers0