I am reading a pdf that has a lot of tables and want to retrieve all the data on in that file and recreate a new pdf with them. Each table has a name so I know exactly what to take. The only problem is there is sometimes a bunch of data I do not need between the tables. Here what the file looks like when I am already looping through the line:
bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunch of data
bunchofdata bunchofdata bunchofdata bunchofdata
Orders
Online orders cards $3450
Shop orders cash $2108.93
Shop cash orders 818.23
total orders amount ----
bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunchofdata
Transactions
cus_id order number date total
data data data data 345
data data data data 873
data data data data 823
end - of - transaction
How do I loop literally by saying: - take everything below orders - stop at endswith('total orders') - take everything below transactions - stop at endswith('end - of - transactions')
I have something similar to this:
# retrieving from the order table
for line in pdf_parser.split('\n'):
if line.startswith('Online orders'):
line=line.split('$')
if len(line)==2:
revenue['Online orders']=line[1]
But with this, I'm only taking one line (the line I am referring to)
here how I am reading the file:
pdf=parser.from_file('myfile.pdf') # from parser is tika
pdf_parser =pdf['content']
The line above does not return an iterator but some form a massive string - I have tried to put it in a list - but I still have that list object, not an iterator
error. what I mean, instead of iterator through the pdf_parser
as above, I appended all the elements of the pdf_aprser in list called container. And I still can used next()