I am doing a personal text mining project with a data mining tool in which I can use python code.
I have several pdf documents in French from which I extract all the text, you can see my code below :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import fitz
# Recipe input
folder = dataiku.Folder("pdfdocuments") # the folder name
paths = folder.list_paths_in_partition()# list all the pdf documents
# Core recipe
df = pd.DataFrame(columns=['path', 'text'])
for i,j in enumerate(paths):
with folder.get_download_stream(j[1:]) as stream: #get the pdf document in memory data
data = stream.read() #read the memory data
doc = fitz.open("pdf", data) #open doc in pdf
data=""
for page in doc:
data+=page.getText()
df.loc[i] = [j[1:], data]
# Recipe output
output = dataiku.Dataset("extractedTexts")
output.write_with_schema(df)
The problem is that since I am extracting all the text contained in the pdf, I also have the header and footer text. On one document, I can use regex to clean it up in the tool. But when I have several pdf documents with different styles so it becomes difficult to clean that up with regex only...
So I am looking for a way to remove the header and footer when I extract all the text from the pdf documents.