5

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page i.e., it joins two different lines from neighbouring paragraph to make a sentence. How do I extract the lines exactly the way it is present in the report. I have attached the version of the report and the lines extracted by the function.

Below is the function that I use:

#function to obtain the pdf from a url:

def converter(url):
    text=[]
    req= requests.get(url)
    with pdfplumber.open(BytesIO(req.content)) as pdf:
        for i in range(0, len(pdf.pages)):
            pages= pdf.pages[i]
            text.append(pages.extract_text())
    return "\n".join(str(i) for i in text)

The image is a snippet from the report I am extracting, the text in the report is divided into two columns and the extract_content function mixes up these two columns to get a line, i.e., joins lines in two columns and presents as a single line.

Here is the first line of the report (starting of the first column and second column merged together by the function):

\nOne of my first duties in 2019 was an interview When we embarked on a new strategy period \non the “Good Morning Norway” show to talk in 2016, I expressed a hope that AF would feel \nabout AF’s goal of doubling the percentage of just as close-knit when we hopefully surpass \nwomen

It would be helpful if I can extract the sentences in the exact way given in report.

pppery
  • 3,731
  • 22
  • 33
  • 46

1 Answers1

5

This is based on a response by samkit-jain to an issue on the package.

The key is page.crop

Assuming there is no header information, crop the page into two halves:

left = page.crop((0, 0, 0.5 * page.width, 0.9 * page.height))
right = page.crop((0.5 * page.width, 0, page.width, page.height)

Then extract text and concatenate:

l_text = left.extract_text()
r_text = right.extract_text()
text = l_text + " " + r_text

Of course, if a page on your report has a figure that spans both columns, that will be messed up by this approach, so you may have to customize this per page.

Kaia
  • 862
  • 5
  • 21