I am trying to extract data out of invoices (pdf), write that data into a csv and extract the needed information into a GUI (for example how many of that product were sold that week)
I cant use pypdf because the "print to pdf" in windows apparently stores pfds it makes as some kind of picture or something... for reference: Pypdf extracts code from one PDF, but not from another?
My Problem:
I am using this code to extract the data (a nice person on this site already helped me with that)
from tika import parser
raw = parser.from_file('2.pdf')
print(raw['content'])
That gives me:
Produktliste Schickmaier Excel.xlsx
LIEFERSCHEIN
Kunde Customer Adresse Adress
Adress Data Data
K/DB-Nr. 211 Contact
Preis/N M Gesamtpreis
Bio Erdbeer-Chilischokolade 3,05 € 20 61,09 € Bio Beuscherl 5,23 € 6 31,36 € Bio ChiliconCarne 5,98 € 15 89,77 € Bio Geschnetzeltes 5,23 € 15 78,41 €
Versand Brutto Versand Netto - €
Warenwert netto 10% 260,64 € Umsatzsteuer 10% 26,06 €
RECHNUNGSBETRAG BRUTTO 286,70 € Seite 1/1
2019/
Data
I tried numerous times now to use that data, to either clean it up in the buffer or write it to txt or csv and then clean it up, but nothing works, it would already help a lot if i at least could write it to a txt and then go from there, which is not nice at all, but i am new and i have limited possibilities :/ best would be to write it to a csv in cleaned up form, add all the other invoices and then use the data - which i am planning to do, but programming is hard xD I already went and worked on the GUI, but this data issue hurts
Also, i spent hours watching vids and trying to find a solution, but i couldnt get anything to run that would give me even approx what i need. I promise, i am not using you guy´s time without before searching myself
Perfect would be if i get one line of CSV for every invoice, with the words into different cells so i can use those to add them up and do make the developement of our new small company more visible while teaching myself how to program Much thanks!