2

I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib. Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:

import requests
from io import StringIO, BytesIO
import PyPDF2

def pdf_content_extraction(pdf_link):

    all_pdf_content = ''

    #sending requests
    response = requests.get(pdf_link)
    my_raw_data = response.content


    pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
    #extract text page by page
    with BytesIO(my_raw_data) as data:
        read_pdf = PyPDF2.PdfFileReader(data)

        #looping trough each page
        for page in range(read_pdf.getNumPages()):
            page_content = read_pdf.getPage(page).extractText()
            page_content = page_content.replace("\n\n\n", "\n").strip()

            #store data into variable for each page
            pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'

    all_pdf_content += pdf_file_text + "\n\n"
        
    return all_pdf_content



pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'

print(pdf_content_extraction(pdf_link))

This is the result that I'm getting:

#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...

My question is, how can I fix this problem? Is there a way to remove watermark from page or something like that? I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?

taga
  • 3,537
  • 13
  • 53
  • 119
  • It looks like the text in your pdf is encoded using a non-standard, ad-hoc generated encoding, and it misses a **ToUnicode** map for text extraction. – mkl Mar 10 '21 at 10:29
  • do you know how to fix that? – taga Mar 10 '21 at 10:46
  • How about giving this a shot https://stackoverflow.com/questions/23821204/read-pdf-in-python-and-convert-to-text-in-pdf/45080979#45080979 ? – Tarun Lalwani Mar 10 '21 at 19:12
  • *"do you know how to fix that?"* - Well, one can try and repair each font in each document manually (like in [this PDFBox/Java proof-of-concept](https://stackoverflow.com/a/39644941/1729265)) or one can go for OCR. – mkl Mar 10 '21 at 20:55
  • @TarunLalwani The problem is that I cannot download the file. The key is to extract text from link – taga Mar 11 '21 at 08:52
  • @mkl I need to do it automaticly, without manual work – taga Mar 11 '21 at 08:53
  • @taga when you do `response = requests.get(pdf_link)` you are basically downloading it only, just not to a file – Tarun Lalwani Mar 11 '21 at 08:54
  • *"I need to do it automaticly, without manual work"* - then first extract as you used do, then check whether it makes sense (e.g. a dictionary test), and if it doesn't, go for ocr. – mkl Mar 11 '21 at 13:30
  • As explained in my answer my assumption the cause was some missing ToUnicode map is wrong. – mkl Mar 15 '21 at 16:59

3 Answers3

5

The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.

When I look at the underlying info of your PDF I note that it was created using these apps:

  • 'Producer': 'GPL Ghostscript 9.10'
  • 'Creator': 'PDFCreator Version 1.7.3

When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:

  • '/Creator': 'Acrobat PDFMaker 11 für Excel'
  • '/Producer': 'Adobe PDF Library 11.0'

I can read the second file perfectly with PyPDF2.

When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.

  • 'Producer': 'GPL Ghostscript 9.10'
  • 'Creator': 'PDFCreator Version 1.7.3

This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.

I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.

I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.

Here are some of the sequences:

page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!


page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜ 
˘ˆ!"""ˆ˘ˆ!

I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.

I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.


You can definitely extract the content from your document using other Python modules.

Tika

This example uses Tika and BeautifulSoup to extract the content in German from your source document.

import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup

pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
    parse_pdf = parser.from_buffer(data, xmlContent=True)

    # Parse metadata from the PDF
    metadata = parse_pdf['metadata']

    # Parse the content from the PDF
    content = parse_pdf['content']

    # Convert double newlines into single newlines
    content = content.replace('\n\n', '\n')
    soup = BeautifulSoup(content, "lxml")
    body = soup.find('body')
    for p_tag in body.find_all('p'):
        print(p_tag.text.strip())

pdfminer

This example uses pdfminer to extract the content from your source document.

import requests
from io import BytesIO
from pdfminer.high_level import extract_text


pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
    text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
                        codec='utf-8', laparams=None)
    print(text.replace('\n\n', '\n').strip())
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • two questions: how to know how many pages my file has? and second, for some files Im getting this kind of response '. . . ..cid:57)(cid:72)(cid:85)(cid:75)(cid:68)(cid:81....' – taga Mar 16 '21 at 10:38
  • I noted this issue across several of your PDF extractions. It's related to the extraction module not being able to handle special symbols and other characters. I saw that you opened an issue with the owner of *pdfreader *on this problem. He stated that he was working on a patch for this issue, which he has posted. – Life is complex Mar 17 '21 at 03:13
0
import requests
from io import StringIO, BytesIO
import PyPDF2  
def remove_watermark(wm_text, inputFile, outputFile):
        from PyPDF4 import PdfFileReader, PdfFileWriter
        from PyPDF4.pdf import ContentStream
        from PyPDF4.generic import TextStringObject, NameObject
        from PyPDF4.utils import b_
        
        with open(inputFile, "rb") as f:
            source = PdfFileReader(f, "rb")
            output = PdfFileWriter()
    
            for page in range(source.getNumPages()):
                page = source.getPage(page)
                content_object = page["/Contents"].getObject()
                content = ContentStream(content_object, source)
    
                for operands, operator in content.operations:
                    if operator == b_("Tj"):
                        text = operands[0]
    
                        if isinstance(text, str) and text.startswith(wm_text):
                            operands[0] = TextStringObject('')
    
                page.__setitem__(NameObject('/Contents'), content)
                output.addPage(page)
    
            with open(outputFile, "wb") as outputStream:
                output.write(outputStream)
                
    wm_text = 'wm_text'
    inputFile = r'input.pdf'
    outputFile = r"output.pdf"
    remove_watermark(wm_text, inputFile, outputFile)
  • Considering the text extraction garbage the op gets, it is unlikely that the watermark text can be found as clear text in some **Tj** argument. – mkl Mar 12 '21 at 10:54
0

In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.

The PageObject method extractText is documented as follows:

extractText()

Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Returns: a unicode string object.

(PyPDF2 1.26.0 documentation, visited 2021-03-15)

So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.

As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.

Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.

mkl
  • 90,588
  • 15
  • 125
  • 265