I am doing a personal text mining project with a data mining tool in which I can use python code.
I have several pdf documents in French from which I extract all the text, you can see my code below :
import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import fitz
# Recipe input
folder = dataiku.Folder("pdfdocuments") # the folder name
paths = folder.list_paths_in_partition()# list all the pdf documents
# Core recipe
df = pd.DataFrame(columns=['path', 'text'])
for i,j in enumerate(paths):
with folder.get_download_stream(j[1:]) as stream: #get the pdf document in memory data
data = stream.read() #read the memory data
doc = fitz.open("pdf", data) #open doc in pdf
data=""
for page in doc:
data+=page.getText()
df.loc[i] = [j[1:], data]
# Recipe output
output = dataiku.Dataset("extractedTexts")
output.write_with_schema(df)
II would like to know if it was possible to extract precisely all the text of a chapter, for example, I want to extract all the Introduction chapter and its sub-chapters from all the pdf documents. Or another example, I would like to extract only the text of chapter 2 from all pdfs. Because not all pdf documents have the same chapter name that I am aiming for.
I am doing text classification and I wanted to test a new classification on a particular chapter of all pdfs.