2

I am reading a pdf that has a lot of tables and want to retrieve all the data on in that file and recreate a new pdf with them. Each table has a name so I know exactly what to take. The only problem is there is sometimes a bunch of data I do not need between the tables. Here what the file looks like when I am already looping through the line:

bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunch of data
bunchofdata bunchofdata bunchofdata bunchofdata
Orders
Online orders cards $3450
Shop orders cash $2108.93
Shop cash orders  818.23
total orders amount ----             

bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunchofdata
bunchofdata bunchofdata bunchofdata bunchofdata

Transactions
cus_id order number date total
data   data  data   data 345
data   data  data   data 873
data   data  data   data 823
end - of - transaction

How do I loop literally by saying: - take everything below orders - stop at endswith('total orders') - take everything below transactions - stop at endswith('end - of - transactions')

I have something similar to this:

 # retrieving from the order table
 for line in pdf_parser.split('\n'):
    if line.startswith('Online orders'):
       line=line.split('$')
          if len(line)==2:
            revenue['Online orders']=line[1]

But with this, I'm only taking one line (the line I am referring to)

here how I am reading the file:

pdf=parser.from_file('myfile.pdf') # from parser is tika
pdf_parser =pdf['content']

The line above does not return an iterator but some form a massive string - I have tried to put it in a list - but I still have that list object, not an iterator error. what I mean, instead of iterator through the pdf_parser as above, I appended all the elements of the pdf_aprser in list called container. And I still can used next()

Herc01
  • 610
  • 1
  • 8
  • 17

1 Answers1

3

This is a solution I use for such issues, demonstrated on a text file rather than a pdf,

data = []
with open('test.txt', 'r') as fin:
    for line in fin:
        if 'Orders' in line:
            line = next(fin)
            while not ('total orders' in line):
                data.append(line)
                line = next(fin)
        if 'Transactions' in line:
            line = next(fin)
            while not ('end - of - transaction' in line):
                data.append(line)
                line = next(fin)

I assume you can apply it as is to your pdf. The idea is that you load the file line by line, and when you encounter Order or Transactions you just collect everything until you reach the line that marks the end of the block you want to store. You achieve this by using a proper while loop and by moving to next lines using the next() call.

This means your blocks need to start and end in a hardcoded way. It also doesn't save the line at the block end, but that can be easily adjusted.

Another way would be to load the whole file at once, then loop through the file and collect the indices of lines that have the block boundaries (e.g. Orders and total orders). Then you could just fetch the lines between the collected indices. This is a good solution if your file is not too large for your memory.


To use the parser approach you need to turn the rawList you get from splitting into an iterator:

from tika import parser

rawText = parser.from_file('test.pdf')
rawList = rawText['content'].splitlines()
iter_list = iter(rawList)

data = []
for line in iter_list:
    if 'Orders' in line:
        line = next(iter_list)
        while not ('total orders' in line):
            data.append(line)
            line = next(iter_list)
    if 'Transactions' in line:
        line = next(iter_list)
        while not ('end - of - transaction' in line):
            data.append(line)
            line = next(iter_list)

The rest of the code follows the same logic as before. The parsing snippet is from here and the iterator details can be found here.

atru
  • 4,699
  • 2
  • 18
  • 19