looping through pdf files with tabulizer in python

Question

I'm having a hard time getting a piece of code to work. I want to loop through pdf files in a folder, extract what the tabula package thinks the tables are, extract these to a dataframe, and write all the tables from a specific pdf into a one csv file.

I looked at this post (and several others) but still I have problems getting it to work. It seems that the script loops through the files, extract some tables, but it doesn't seem to iterate over the files, and I can't get it to write all the dataframes in to a csv file. The script just writes the last one in the csv.

This is what I have so far. Any help would be greatly appreciated, specifically, how to loop correctly through the files and to write all tables from one pdf into one csv file. I'm pretty stuck...

pdf_folder = 'C:\\PDF extract\\pdf\\'
csv_folder = 'C:\\PDF extract\\csv\\'  

    paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]
    for path in paths:
        listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
        path = path.replace('pdf', 'csv')
        for df in listdf: (df.to_csv(path, index = False))

Shouldn't you be using `csv_folder` somewhere? – Scott Hunter Jun 09 '17 at 18:24 — Scott Hunter, Jun 09 '17 at 18:24

SeaMonkey · Answer 1 · 2017-06-09T18:40:54.190

1

Just like @Scott Hunter mentioned, you are not using CSV_folder

Also, I think you are overwriting the created .csv files:

for df in listdf: (df.to_csv(path, index = False))

For each iteration of the for-loop, the path variable stays the same.

Edit: You should probably try to do something like this:

pdf_folder = 'C:\\PDF extract\\pdf\\'
paths = [pdf_folder + fn for fn in os.listdir(pdf_folder) if fn.endswith('.pdf')]

for path in paths:
    listdf = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', nospreadsheet = True,multiple_tables=True)
    path = path.replace('pdf', 'csv')
    df_concat = pd.concat(listdf)
    df_concat.to_csv(path, index = False)

edited Jun 09 '17 at 18:40

answered Jun 09 '17 at 18:33

SeaMonkey

131
9

Because they all come from the same PDF; probably means to *add* each to the file, but as you say, is overwriting instead. – Scott Hunter Jun 09 '17 at 18:34
Thanks, this indeed seem to work. Unfortunately, Tabula isnt very accurate (or capable) of identifying table from non table (in scientific papers). I got a lot of regular text put into tables by Tabula. But it doesn't seem to miss tables. So, manual editing is still in order. Therefore, instead of csv Im using excel now, which is easier to manipulate. – CMorgan Jun 27 '17 at 20:04

looping through pdf files with tabulizer in python

1 Answers1

Linked