How can I extract narrative sections of PDF annual reports for text analysis purposes?

Question

I would like to compare SFCR reports and IFRS reports of insurance companies using the FOG index as part of my bachelor thesis. The reports are provided in PDF format.

I want to work with the Fathom package in Perl, but for this I need the narrative areas of the financial statements in txt format. Do you have an idea how this could work without me having to copy everything over manually?

Thanks in advance!

Can you give an example of the reports, e.g. provide a link to a PDF file? — Håkon Hægland, Nov 10 '20 at 21:55
Of course! Under the following link you find the intregrated Annual Report as well as the SFCR Report. https://www.aegon.com/investors/annual-reports/ — kinku, Nov 10 '20 at 22:54
Thanks, and which part of the report do you want to extract ? For example for the above linked-to `aegon-integrated-annual-report-2019.pdf`, which pages? — Håkon Hægland, Nov 10 '20 at 23:03
Hey Håkon, I want to extract all narrative areas as raw text or in other words: all tables, figures, headings and images should be deleted. — kinku, Nov 10 '20 at 23:12
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf — brian d foy, Nov 11 '20 at 05:38

score 1 · Answer 1 · answered Nov 10 '20 at 23:27

The Python module pdfminer can be used to extract all the text (also text in figures and tables):

$ pip install pdfminer
$ qpdf --decrypt --password='' report.pdf report2.pdf
$ pdf2txt.py -o report2.txt report2.pdf

This saves the extracted text to report2.txt. Note that I used the sample PDF file aegon-integrated-annual-report-2019.pdf. This file turned out to be encrypted and pdf2txt.py refused to process it, but luckily qpdf was able to decrypt it as shown above.

How can I extract narrative sections of PDF annual reports for text analysis purposes?

1 Answers1