I am trying to parse the text of a PDF. I have successfully converted the PDF to text using Apache PDFBox PDFTextStripper - I used jpype to access Apache PDFbox in Python.
The PDF text is stored in "pdf_text" variable.
As I was processing the text, I noticed an issue. If I print the variable by typing its name in jupyter notebook, I get something like the following:
'Safety\xa0Data\xa0Sheet\xa0\xa0 Stock\xa0Number:\xa0Revision\xa0Date:\xa0Replaces:\xa0300900004‐12‐2018\xa009‐05‐2017\xa0\xa0TECTYL®\xa0506\xa0\xa01.\xa0Identification\xa0 \xa0P'
But when I use print(text), I get:
'safety data sheet stock number: revision date: replaces: 300900004‐12‐2018 09‐05‐2017 tectyl 506 1 identification product identifier used on the label: tectyl 506'
My processing is failing because it considers the former output, but I want the latter.
Would anyone please shed light on this issue? How can I convert my string to the latter format?
Thank you!