0

A quick question:

I opened a .PDF file with Notepad++ and saved it as a .txt file. It has the following value:

%PDF-1.7\n\n4 0 obj\n(Identity)\nendobj\n5 0 obj(Adobe)endobj8 0 obj> stream xœì½x\ÅÕ7>sïÝÞ«¶hµ»ZíJòªKV³,­Õ­b[’eK²eKVqaÝmlÜ0Íу˜NB Á$ÙÆ¢›¼¦…’˜4JpH€ " éæÎcxóþŸïý¾G#Ÿ=¿™;3wæÌ™3gæÞ]#Œ²Ã‡€:Ê›fWÕþ°ã ’ý~+Bž£¥åó_{óÒÕ¿™€õ®ŠÒº²‹U3¯ý!E¤ª¼¢rÁ«|ˆ{w!..................................

I am thinking of converting the text to PDF using Python.

May I know what are the modules needed?

grc
  • 85
  • 2
  • 11
  • Can't you just save it as .pdf again? This seems like the long way around. – jambrothers Mar 28 '17 at 10:37
  • Hi @jambrothers where i am coming from is trying to read a db column with attachment (which i do not have access to the files) thus the long way around was needed. I will have to extract text from the file itself and insert it into another db column.. The files stored in db (with image datatype) is displayed as '0x255044...' however when read into python df, it is displayed as '%PDF-1.7\n\n4 0 obj\n(Identity)\nendobj\n5 0 obj(Adobe)endobj8 0 obj> stream xœì½x\ÅÕ7>sïÝÞ«¶hµ»....' thus i was thinking of reading and "converting" it into the original doc then extract the text within the file. – grc Mar 28 '17 at 14:00
  • So if I understand correctly what you want to do is extract the human-readable text from a PDF file which you have saved with a .txt extension? If so I'd refer to the answers here: http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text?rq=1 – jambrothers Mar 28 '17 at 14:27

0 Answers0