I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page i.e., it joins two different lines from neighbouring paragraph to make a sentence. How do I extract the lines exactly the way it is present in the report. I have attached the version of the report and the lines extracted by the function.
Below is the function that I use:
#function to obtain the pdf from a url:
def converter(url):
text=[]
req= requests.get(url)
with pdfplumber.open(BytesIO(req.content)) as pdf:
for i in range(0, len(pdf.pages)):
pages= pdf.pages[i]
text.append(pages.extract_text())
return "\n".join(str(i) for i in text)
Here is the first line of the report (starting of the first column and second column merged together by the function):
\nOne of my first duties in 2019 was an interview When we embarked on a new strategy period \non the “Good Morning Norway” show to talk in 2016, I expressed a hope that AF would feel \nabout AF’s goal of doubling the percentage of just as close-knit when we hopefully surpass \nwomen
It would be helpful if I can extract the sentences in the exact way given in report.