I used pdf2text from PDFminer to reduce a PDF to text. Unfortunately it contains special characters. Let me show output from my console
>>>a=pdf_to_text("ap.pdf")
heres a sample of it, a little truncated
>>>a[5000:5500]
'f one architect. Decades ...... but to re\xef\xac\x82ect\none set of design ideas, than to have one that contains many\ngood but independent and uncoordinated ideas.\n1 Joshua Bloch, \xe2\x80\x9cHow to Design a Good API and Why It Matters\xe2\x80\x9d, G......=-3733'
I understood that I must encode it
>>>a[5000:5500].encode('utf-8')
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 237: ordinal not in range(128)
I searched around a bit and tried them, notably Replace special characters in python . The input comes from PDFminer, so its tough (AFAIK) to control that. What is the way to make proper plaintext from this output?
What am I doing wrong?
--A quick fix: change PDFminer's codec to ascii- but it's not a lasting solution--
--Abandoned the quick fix for the answer- changing the codec removes information --
--A relavent topic as mentioned by Maxim http://en.wikipedia.org/wiki/Windows-1251 --