1

I want to extract the text content of this PDF: https://www.welivesecurity.com/wp-content/uploads/2019/07/ESET_Okrum_and_Ketrican.pdf

Here is my code:

import os
import re
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

def get_pdf_text(path):
    rsrcmgr = PDFResourceManager()
    with StringIO() as outfp, open(path, 'rb') as fp:
        device = TextConverter(rsrcmgr, outfp)
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, check_extractable=True):
            interpreter.process_page(page)
        device.close()
        text = re.sub('\\s+', ' ', outfp.getvalue())
    return text

if __name__ == '__main__':
    path = './ESET_Okrum_and_Ketrican.pdf'
    print(get_pdf_text(path))

But in the extracted text, some period characters are missing:

is a threat group believed to be operating out of China Its attacks were first reported in 2012, when the group used a remote access trojan (RAT) known as Mirage to attack high-profile targets around the world However, the group’s activities were traced back to at least 2010 in FireEye’s 2013 report on operation Ke3chang – a cyberespionage campaign directed at diplomatic organizations and missions in Europe The attackers resurfaced

It really annoys me, because I'm doing natural language processing on the extracted text, and without the periods the whole document is considered as one big sentence.

I strongly suspect that it's because the /ToUnicode map of the PDF contains bad data, because I had the same problem with PDF.js. I have read this answer that says that whenever the /ToUnicode map of a PDF is bad, there is no way to correctly extract its text without doing OCR.

But I have also been using pdf2htmlEX and PDFium (the PDF renderer of Chrome), and they all work very well to extract all the characters of a PDF (at least for this PDF, that is).

For instance, when I give this PDF to pdf2htmlEX, it detects that the /ToUnicode data is bad and it drops the font for a new one:

pdf2htmlEX ToUnicode

So my question is, is it be possible for PDFMiner to use the same feature than pdf2htmlEX and PDFium and that allows to extract correctly all the characters of a PDF even with bad /ToUnicode data?

Thank you for your help.

JacopoStanchi
  • 1,962
  • 5
  • 33
  • 61

2 Answers2

2

I don't think this is fixable, because the tool does nothing wrong. After investigation, the PDF writes out a real period, the instruction used is:

(.) Tj

The (.) stands for character 0x2E (which is the correct character for a period (or "full stop") in Unicode as well).

However, the font used has a ToUnicodeMap (yeay!), but it appears to be mapping the period to the wrong character (boo!):

<2E> <0020>

So the period character is mapped to the 0x0020 character, which is, wait for it, a space.

So your options are to find a tool that can fix this in the Unicode Map for this font (I don't know of any), or use something like OCR instead.

David van Driessche
  • 6,602
  • 2
  • 28
  • 41
  • Thank you for your answer, like I thought it's the same problem than PDF.js, but in this case do you know what pdf2htmlEX and PDFium do to fix this issue? They both extract the periods correctly and I really don't think they're using OCR. – JacopoStanchi Jul 19 '20 at 11:02
1

Actually the PDF is similar to the one inspected in this answer:

  • According to the Encoding entry of the font at hand, it uses regular WinAnsiEncoding for codes from 0x20 on upwards, so the code 0x2E would represent the period character.

  • As @David already pointed out in his answer, though, the code 0x2E (a period according to the Encoding, see above) in the ToUnicode map is mapped to U+0020, the regular space character.

  • In the page content streams yet another mechanism to map drawn text to Unicode is used, marked content with ActualText properties, e.g. in case of the extracted text quoted by the OP:

    (, also known as APT15, is a threat group believed to be operating out of\
     China)Tj
    /Span<</ActualText<FEFF002E>>> BDC 
    (.)Tj
    EMC  
    

    i.e. the 0x2E (= '.' in ASCII) code in (.)Tj, which according to the Encoding represents a period which in turn by the ToUnicode map is corrected to represent a space character, is marked to actually represent 0xFEFF002E in UTF16 Unicode which is a BOM and a period character.

Thus,

  • text extractors only seeing the Encoding of the font see 0x2E as period (most likely pdf2htmlEX is such a case, explicitly ignoring the ToUnicode map here);
  • text extractors also seeing the ToUnicode map but not the ActualText marked text property see 0x2e as space (as pdfminer does);
  • text extractors also seeing the ActualText marked text property see 0x2E as period (e.g. Adobe Reader copy&paste).

This explicit misleading of some text extractors usually is done to make automatic text extraction (most such automatic text extractors use ToUnicode but not ActualText) extract incorrectly while still allowing copy&paste from Adobe Reader.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Thank you for your answer, so you're telling me pdf2htmlEX and Adobe Reader both extract correctly the period characters, but not for the same reason? The former just ignores whole parts of the data while the latter is more advanced and considers the ActualText property? I would also like to know how 'safe' is the pdf2htmlEX strategy, does it happen often that the Encoding data maps to the wrong character? – JacopoStanchi Jul 20 '20 at 07:20
  • Besides, do you know of any software that would allow to do automatic text extraction like Adobe Acrobat (i.e. with the ActualText method)? I have seen that Acrobat has a SDK but all the examples I have seen only work on Windows. – JacopoStanchi Jul 20 '20 at 07:35
  • Lastly and sorry to bother you, but are you sure that pdf2htmlEX belongs to the first category? Maybe it drops the ToUnicode map because there is an ActualText property, no? – JacopoStanchi Jul 20 '20 at 07:42
  • According to your screen shot `pdf2htmlEX` drops the **ToUnicode** map because it considers it invalid, and it drops it completely, not only for the period characters. That it ignores the **ActualText** admittedly is an assumption, albeit one I'd consider quite likely to be correct. To be 100% sure I'd have to experiment or study the code. – mkl Jul 20 '20 at 11:17
  • *"do you know of any software that would allow to do automatic text extraction like Adobe Acrobat"* - not in the Python context (because I hardly know the libs and apps here). In the Java context I at least know that text extraction with some common general purpose PDF libraries can easily be tweaked to also take **ActualText** into account. – mkl Jul 20 '20 at 11:21
  • Python is quite handy for communicating with codebases in different programming languages, for example I have tried Tika in Python but it had the same problem. So the Java libraries you are talking about could be very useful to me. – JacopoStanchi Jul 20 '20 at 11:47
  • Which tool did you use to inspect the PDF content? I opened it with a text editor but the streams were not properly encoded so I couldn't see the `ActualText` properties. – JacopoStanchi Jul 23 '20 at 12:52
  • 1
    *"opened it with a text editor but the streams were not properly encoded so I couldn't see the ActualText properties"* - they are properly encoded, properly for PDF streams. Which allows them to be compressed which they usually are. I use iText RUPS or PDFBox PDFDebugger but there also are other such PDF inspection tools. – mkl Jul 23 '20 at 15:18