-1

I want to loop through all the PDFs in a directory, extract the text from each one using PDFminer, and then write the output to a single CSV file. I am able to extract the text from each PDF individually by passing it to the function defined here. I am also able to get a list of all the PDF filenames in a given directory. But when I try to put the two together and write the results to a single CSV, I get a CSV with headers but no data.

Here is my code:

import os
pdf_files = [name for name in os.listdir("C:\\My\\Directory\\Path") if name.endswith(".pdf")] #get all files in directory    
pdf_files_path = ["C:\\My\\Directory\\Path\\" + pdf_files[i] for i in range(len(pdf_files))] #add directory path

import pandas as pd
df = pd.DataFrame(columns=['FileName','Text'])

for i in range(len(pdf_files)):
    scraped_text = convert_pdf_to_txt(pdf_files_path[i])
    df.append({ 'FileName': pdf_files[i], 'Text': scraped_text[i]},ignore_index=True)

df.to_csv('output.csv')

The variables have the following values:

pdf_files: ['12280_2007_Article_9000.pdf', '12280_2007_Article_9001.pdf', '12280_2007_Article_9002.pdf', '12280_2007_Article_9003.pdf', '12280_2007_Article_9004.pdf', '12280_2007_Article_9005.pdf', '12280_2007_Article_9006.pdf', '12280_2007_Article_9007.pdf', '12280_2007_Article_9008.pdf', '12280_2007_Article_9009.pdf']

pdf_files_path: ['C:\\My\\Directory Path\\12280_2007_Article_9000.pdf', etc...]

Empty DataFrame
Columns: [FileName, Text]
Index: []

Update: based on a suggestion from @AMC I checked the contents of scraped_text in the loop. For the Text column, it seems that I'm looping through the characters in the first PDF file, rather than looping through each file in the directly. Also, the contents of the loop are not getting written to the dataframe or CSV.

12280_2007_Article_9000.pdf E
12280_2007_Article_9001.pdf a
12280_2007_Article_9002.pdf s
12280_2007_Article_9003.pdf t
12280_2007_Article_9004.pdf  
12280_2007_Article_9005.pdf A
12280_2007_Article_9006.pdf s
12280_2007_Article_9007.pdf i
12280_2007_Article_9008.pdf a
12280_2007_Article_9009.pdf n
b00kgrrl
  • 559
  • 2
  • 9
  • 30

1 Answers1

4

I guess you don't need pandas for that. You can make it simpler by using the standard library csv.

Another thing that can be improved, if you are using Python 3.4+, is to replace os with pathlib.

Here is an almost complete example:

import csv
from pathlib import Path


folder = Path('c:/My/Directory/Path')
csv_file = Path('c:/path/to/output.csv')

with csv_file.open('w', encoding='utf-8') as f:
    writer = csv.writer(f, csv.QUOTE_ALL)

    writer.writerow(['FileName', 'Text'])

    for pdf_file in folder.glob('*.pdf'):
        pdf_text = convert_pdf_to_txt(pdf_file).replace('\n', '|')
        writer.writerow([pdf_file.name, pdf_text]) 

Another thing to bear in mind is to be sure pdf_text will be a single line or else your csv file will be kind of broken. One way to work around that is to pick an arbitrary character to use in place of the new line marks. If you pick the pipe character, for example, than you can do something like this, prior to writer.writerow:

pdf_text.replace('\n', '|')

It is not meant to be a complete example but a starting point. I hope it helps.

accdias
  • 5,160
  • 3
  • 19
  • 31
  • When I try that, I get a permissions error. [Errno 13] Permission denied – b00kgrrl Feb 29 '20 at 13:00
  • @b00kgrrrl most permissions denied issues can be solved by running the command with `sudo` – nz_21 Feb 29 '20 at 13:06
  • I'm not sure what's going on with the directory permissions, but you were right about the newline characters. That's what seems to be breaking the CSV. – b00kgrrl Feb 29 '20 at 13:15
  • I double checked the directory and it's correct. When I created an abbreviated version of the code for the purpose of posting a question, I inadvertently introduced an inconsistency. – b00kgrrl Feb 29 '20 at 13:21
  • As you requested, I've posted the full error message – b00kgrrl Feb 29 '20 at 13:33
  • Answer updated to work around `UnicodeEncodeError`. Seems like Windows uses cp1252 by default when opening a file for writing. The work around is to explicitly open the file with `enconding='utf-8'` instead. – accdias Feb 29 '20 at 14:18
  • That worked, thanks! I'm still having issues with commas and newlines, but I can figure that out. – b00kgrrl Feb 29 '20 at 14:22
  • I'm glad to help. – accdias Feb 29 '20 at 14:22