1

I am doing a personal text mining project with a data mining tool in which I can use python code.

I have several pdf documents in French from which I extract all the text, you can see my code below :

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import fitz

# Recipe input

folder = dataiku.Folder("pdfdocuments") # the folder name
paths = folder.list_paths_in_partition()# list all the pdf documents

# Core recipe
df = pd.DataFrame(columns=['path', 'text'])
        
for i,j in enumerate(paths):
    with folder.get_download_stream(j[1:]) as stream: #get the pdf document in memory data
        data = stream.read() #read the memory data
        doc = fitz.open("pdf", data) #open doc in pdf
        data=""
        for page in doc:
            data+=page.getText()
        
        df.loc[i] = [j[1:], data]

# Recipe output

output = dataiku.Dataset("extractedTexts")
output.write_with_schema(df)

The problem is that since I am extracting all the text contained in the pdf, I also have the header and footer text. On one document, I can use regex to clean it up in the tool. But when I have several pdf documents with different styles so it becomes difficult to clean that up with regex only...

So I am looking for a way to remove the header and footer when I extract all the text from the pdf documents.

JBE
  • 11,917
  • 7
  • 49
  • 51
talohsa
  • 37
  • 6

0 Answers0