3

What I want to do: extract text information from a pdf file and redirect that to a txt file.

What I did:

pip install pdfminor

pdf2txt.py file.pdf > output.txt

What I got:

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 0: illegal multibyte sequence

My observation:

\u2022 is bullet point, .

pdf2txt.py works well without redirection: the bullet point character is written to stdout without any error.

My question:

Why does redirection cause a python error? As far as I know, redirection is a O.S. job, and it is simply copying things after the program is finished.

How can I fix this error? I cannot do any modification to pdf2txt.py as it's not my code.

  • Python needs to know what encoding to use for output. It can choose a different encoding depending on whether the output is going to a terminal or a file. – Mark Ransom Jan 17 '20 at 00:19
  • Ok, thank you Mark, any suggestion on how to fix it? – Wenqing Zong Jan 17 '20 at 00:23
  • I think there's an environment variable that affects it, but I don't have time now to look it up. – Mark Ransom Jan 17 '20 at 00:25
  • It's fine, I can wait for other people to help me. Thanks a lot for answering me. – Wenqing Zong Jan 17 '20 at 00:28
  • normally Python gets encoding used by terminal to encode text before send to terminal but when you redirect then it can't get encoding from terminal - you would have to set encoding manually in python script - probably in every `print()` – furas Jan 17 '20 at 01:12
  • BTW: using Google `python redirect utf-8` I found [UnicodeDecodeError when redirecting to file](https://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file) on stackoverflow. Use Google to find more. – furas Jan 17 '20 at 01:15

2 Answers2

2

Redirection causes an error because the default encoding used by Python does not support one of the characters you're trying to output. In your case you're trying to output the bullet character using the GBK codec. This probably means you're using a Chinese version of Windows.

A version of Python 3.6 or later will work fine outputting to the terminal window on Windows, because character encoding is bypassed completely using Unicode. It's only when redirecting the output to a file that the Unicode must be encoded to a byte stream.

You can set the environment variable PYTHONIOENCODING to change the encoding used for stdio. If you use UTF-8 it will be guaranteed to work with any Unicode character.

set PYTHONIOENCODING=utf-8
pdf2txt.py file.pdf > output.txt
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

You seem to have somehow obtained unicode characters from the raw bytes but you need to encode it. I recommend you to use UTF-8 encoding for txt files.

Making the encoding parameter more explicit is probably what you want.

def gbk_to_utf8(source, target):
    with open(source, "r", encoding="gbk") as src: 
        with open(target, "w", encoding="utf-8") as dst: 
            for line in src.readlines():
                dst.write(line)
  • Thanks you, but as I said, I can't modify that python file... – Wenqing Zong Jan 19 '20 at 01:47
  • @NevilleZong according to the question you are running the Python source code directly. Not sure what prevents you from making a copy of `pdf2txt.py` and changing it. – Mark Ransom Jan 20 '20 at 20:12