1

I am doing a personal text mining project with a data mining tool in which I can use python code.

I have several pdf documents in French from which I extract all the text, you can see my code below :

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
import fitz

# Recipe input

folder = dataiku.Folder("pdfdocuments") # the folder name
paths = folder.list_paths_in_partition()# list all the pdf documents

# Core recipe
df = pd.DataFrame(columns=['path', 'text'])
        
for i,j in enumerate(paths):
    with folder.get_download_stream(j[1:]) as stream: #get the pdf document in memory data
        data = stream.read() #read the memory data
        doc = fitz.open("pdf", data) #open doc in pdf
        data=""
        for page in doc:
            data+=page.getText()
        
        df.loc[i] = [j[1:], data]

# Recipe output

output = dataiku.Dataset("extractedTexts")
output.write_with_schema(df)

II would like to know if it was possible to extract precisely all the text of a chapter, for example, I want to extract all the Introduction chapter and its sub-chapters from all the pdf documents. Or another example, I would like to extract only the text of chapter 2 from all pdfs. Because not all pdf documents have the same chapter name that I am aiming for.

I am doing text classification and I wanted to test a new classification on a particular chapter of all pdfs.

JBE
  • 11,917
  • 7
  • 49
  • 51
talohsa
  • 37
  • 6
  • 1
    `PDF` is very complex and strange file and as I know it doesn't have information where starts chapters - you will have to write own code to recognize it by words or font sizes (if only your tool can get font size), distance beteew lines (if only your tool can get positions `(x,y)`), etc, . – furas Jun 17 '21 at 22:19

1 Answers1

0

When undefined by a structured Outline your task is almost impossible without extensive AI, as built by extensive learning.

Here by co-incidence is the latest PDF I am looking at. In which I can see that it has 203 pages comprising of 10 chapters (the word is not used but I can "Presume" 1-10 are such from experience)

I can easily search for the "Table of Contents" and instantly see the first 4 "Chapters" are all on Page 1. However in real world terms that is actually page 11 so I can very easily (mentally) add 10 to each page number for navigation.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36