1

I would like to compare SFCR reports and IFRS reports of insurance companies using the FOG index as part of my bachelor thesis. The reports are provided in PDF format.

I want to work with the Fathom package in Perl, but for this I need the narrative areas of the financial statements in txt format. Do you have an idea how this could work without me having to copy everything over manually?

Thanks in advance!

brian d foy
  • 129,424
  • 31
  • 207
  • 592
kinku
  • 11
  • 1
  • Can you give an example of the reports, e.g. provide a link to a PDF file? – Håkon Hægland Nov 10 '20 at 21:55
  • Of course! Under the following link you find the intregrated Annual Report as well as the SFCR Report. https://www.aegon.com/investors/annual-reports/ – kinku Nov 10 '20 at 22:54
  • Thanks, and which part of the report do you want to extract ? For example for the above linked-to `aegon-integrated-annual-report-2019.pdf`, which pages? – Håkon Hægland Nov 10 '20 at 23:03
  • Hey Håkon, I want to extract all narrative areas as raw text or in other words: all tables, figures, headings and images should be deleted. – kinku Nov 10 '20 at 23:12
  • https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf – brian d foy Nov 11 '20 at 05:38

1 Answers1

1

The Python module pdfminer can be used to extract all the text (also text in figures and tables):

$ pip install pdfminer
$ qpdf --decrypt --password='' report.pdf report2.pdf
$ pdf2txt.py -o report2.txt report2.pdf

This saves the extracted text to report2.txt. Note that I used the sample PDF file aegon-integrated-annual-report-2019.pdf. This file turned out to be encrypted and pdf2txt.py refused to process it, but luckily qpdf was able to decrypt it as shown above.

Håkon Hægland
  • 39,012
  • 21
  • 81
  • 174