Python: How to extract text from multiple pdf into a excel file

Question

i'm a complete beginner in Python and what i'm trying is extracting the text from multiple pdf (contained in different subfolders of a big one) and pasting the text in a excel file with:

A1: Name of the file
A2: text of the file contained in ONE cell

I've tried some solution like this one:


import pdfplumber
import pandas as pd
import os

def extract_pdf(pdf_path):
    linesOfFile = []
    with pdfplumber.open(pdf_path) as pdf:
        for pdf_page in pdf.pages:
            single_page_text = pdf_page.extract_text()
            for linesOfFile in single_page_text.split('\n'):
                linesOfFile.append(line)
                #print(linesOfFile)
    return linesOfFile


folder_with_pdfs = 'folder_path'
linesOfFiles = []
for pdf_file in os.listdir(folder_with_pdfs):
    if pdf_file.endswith('.pdf'):
        pdf_file_path = os.path.join(folder_with_pdfs, pdf_file)
        linesOfFile = extract_pdf(pdf_file_path)
        linesOfFiles.append(linesOfFile)
        
df = pd.DataFrame(linesOfFiles)
df.to_csv('test.csv')

Any help is appreciated

There are at least 2 error: 1° it only take one file per time, i have many subfolder but with this code i'm able to open only one 2° i'm not sure the append method is the right one to build a single Excel cell with all the text, also it returns an error ‘str’ object has no attribute ‘append’ — Gabry, Jun 13 '22 at 09:17
Best to break down your 2 separate issues into narrower questions. In any case, for 1/, i would check out `os.walk` [here](https://stackoverflow.com/questions/2212643/python-recursive-folder-read) — error404, Jun 13 '22 at 09:33

Python: How to extract text from multiple pdf into a excel file

0 Answers0