-1

I am trying to extract data out of invoices (pdf), write that data into a csv and extract the needed information into a GUI (for example how many of that product were sold that week)

I cant use pypdf because the "print to pdf" in windows apparently stores pfds it makes as some kind of picture or something... for reference: Pypdf extracts code from one PDF, but not from another?

My Problem:

I am using this code to extract the data (a nice person on this site already helped me with that)

from tika import parser
raw = parser.from_file('2.pdf')
print(raw['content'])

That gives me:

Produktliste Schickmaier Excel.xlsx

LIEFERSCHEIN

Kunde Customer Adresse Adress

Adress Data Data

K/DB-Nr. 211 Contact

Preis/N M Gesamtpreis

Bio Erdbeer-Chilischokolade 3,05 € 20 61,09 € Bio Beuscherl 5,23 € 6 31,36 € Bio ChiliconCarne 5,98 € 15 89,77 € Bio Geschnetzeltes 5,23 € 15 78,41 €

Versand Brutto Versand Netto - €

Warenwert netto 10% 260,64 € Umsatzsteuer 10% 26,06 €

RECHNUNGSBETRAG BRUTTO 286,70 € Seite 1/1

2019/

Data

I tried numerous times now to use that data, to either clean it up in the buffer or write it to txt or csv and then clean it up, but nothing works, it would already help a lot if i at least could write it to a txt and then go from there, which is not nice at all, but i am new and i have limited possibilities :/ best would be to write it to a csv in cleaned up form, add all the other invoices and then use the data - which i am planning to do, but programming is hard xD I already went and worked on the GUI, but this data issue hurts

Also, i spent hours watching vids and trying to find a solution, but i couldnt get anything to run that would give me even approx what i need. I promise, i am not using you guy´s time without before searching myself

Perfect would be if i get one line of CSV for every invoice, with the words into different cells so i can use those to add them up and do make the developement of our new small company more visible while teaching myself how to program Much thanks!

Schicki
  • 11
  • 3

1 Answers1

0

If you're just looking to get each word into different cells, run a find and replace over the text string. You'll want to replace every break or space with a comma. Add an exception in the Find/Replace to escape existing commas (surround it with double quotes, ie. 23,456 -> "23,456"). Once spaces and breaks are replaced with commas, you can save the string as a .csv. If you're only looking to extract certain values, I think it'd be helpful to be familiar with Regular Expressions.

Here's some information to get line breaks in .csv files

Zach Fey
  • 95
  • 7
  • The thing is, it seems to not be a string, i cant interact with it for some reason i tried to write it into a txt file just to try, but it gives me: TypeError: write() argument must be str, not None – Schicki Nov 21 '19 at 17:39
  • 1
    The answer here might help with the write() error: https://stackoverflow.com/questions/42424379/how-to-fix-typeerror-write-argument-must-be-str-not-none – Zach Fey Nov 21 '19 at 18:05