Extract text from multiple PDFs and write to a single CSV

Question

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.

Here is my code:

import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory    
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path

import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])

for i in range(len(pdf_files)):
    scraped_text = convert_pdf_to_txt(pdf_files_path[i])
    df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)

df.to_csv('output.csv')

The variables have the following values:

pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']

pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]

Empty DataFrame
Columns: [FileName, Text]
Index: []

Update: based on a suggestion from @AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.

12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf  
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n

Please provide a [mcve], or at the very least the contents of the relevant variables. — AMC, Feb 29 '20 at 12:42
As requested, I've added the contents of the relevant variables — b00kgrrl, Feb 29 '20 at 12:52
Did you check the contents of `scraped_text` in the loop? At which point is the problem occurring? — AMC, Feb 29 '20 at 12:54
why do you write out `scraped_text[i]` and not just `scraped_text`? The former gives you just a single character in the text; you want the whole text. — nz_21, Feb 29 '20 at 13:05
I checked the contents of the loop. For some reason, the variables are populated while in the loop, but the final CSV is empty. — b00kgrrl, Feb 29 '20 at 13:15
Now I'm getting this error: UnicodeEncodeError: 'charmap' codec can't encode characters in position 9252-9253: character maps to — b00kgrrl, Feb 29 '20 at 13:42
Just got the new error message and I've updated my answer to work around that. Check it out. — accdias, Feb 29 '20 at 14:15

accdias · Accepted Answer · 2020-02-29T15:58:49.637

4

I guess you don't need pandas for that. You can make it simpler by using the standard library csv.

Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.

Here is an almost complete example:

import csv
from pathlib import Path


folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')

with csv_file.open('w', encoding='utf-8') as f:
    writer = csv.writer(f, csv.QUOTE_ALL)

    writer.writerow(['FileName', 'Text'])

    for pdf_file in folder.glob('*.pdf'):
        pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
        writer.writerow([pdf_file.name, pdf_text])

Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:

pdf_text.replace('\n', '|')

It is not meant to be a complete example but a starting point. I hope it helps.

edited Feb 29 '20 at 15:58

answered Feb 29 '20 at 12:52

accdias

5,160
3
19
31

When I try that, I get a permissions error. [Errno 13] Permission denied – b00kgrrl Feb 29 '20 at 13:00
@b00kgrrrl most permissions denied issues can be solved by running the command with `sudo` – nz_21 Feb 29 '20 at 13:06
I'm not sure what's going on with the directory permissions, but you were right about the newline characters. That's what seems to be breaking the CSV. – b00kgrrl Feb 29 '20 at 13:15
I double checked the directory and it's correct. When I created an abbreviated version of the code for the purpose of posting a question, I inadvertently introduced an inconsistency. – b00kgrrl Feb 29 '20 at 13:21
As you requested, I've posted the full error message – b00kgrrl Feb 29 '20 at 13:33
Answer updated to work around `UnicodeEncodeError`. Seems like Windows uses cp1252 by default when opening a file for writing. The work around is to explicitly open the file with `enconding='utf-8'` instead. – accdias Feb 29 '20 at 14:18
That worked, thanks! I'm still having issues with commas and newlines, but I can figure that out. – b00kgrrl Feb 29 '20 at 14:22
I'm glad to help. – accdias Feb 29 '20 at 14:22

Extract text from multiple PDFs and write to a single CSV

1 Answers1