Extract text from a folder with many pdfs with python pandas and jupyter

Question

I have multiple directories containing many pdf documents. What I would like to do is to convert them with Python to PlainText, all in one file, where I can search for the text in the created .text file and in a second column the reference link to that specific pdf file.

As for a few pdfs in a folder though I use this code from this answer: https://stackoverflow.com/a/66226629/7110233

import os, glob
from tika import parser 
from pandas import DataFrame

# What file extension to find, and where to look from
ext = "*.pdf"
PATH = "."

# Find all the files with that extension
files = []
for dirpath, dirnames, filenames in os.walk(PATH):
    files += glob.glob(os.path.join(dirpath, ext))

# Create a Pandas Dataframe to hold the filenames and the text
df = DataFrame(columns=("filename","text"))

# Process each file in turn, parsing with Tika and storing in the dataframe
for idx, filename in enumerate(files):
   data = parser.from_file(filename)
   text = data["content"]
   df.loc[idx] = [filename, text]

# For debugging, print what we found
print(df)

unfortunately for many more files than many pages the separator enclosed in quotation marks is often misrecognized losing columns. How could I solve this problem? Thanks to whoever will answer me!

Extract text from a folder with many pdfs with python pandas and jupyter

0 Answers0