0

I am trying to parse the text of a PDF. I have successfully converted the PDF to text using Apache PDFBox PDFTextStripper - I used jpype to access Apache PDFbox in Python.

The PDF text is stored in "pdf_text" variable.

As I was processing the text, I noticed an issue. If I print the variable by typing its name in jupyter notebook, I get something like the following:

'Safety\xa0Data\xa0Sheet\xa0\xa0 Stock\xa0Number:\xa0Revision\xa0Date:\xa0Replaces:\xa0300900004‐12‐2018\xa009‐05‐2017\xa0\xa0TECTYL®\xa0506\xa0\xa01.\xa0Identification\xa0 \xa0P'

But when I use print(text), I get:

'safety data sheet stock number: revision date: replaces: 300900004‐12‐2018 09‐05‐2017 tectyl 506 1 identification product identifier used on the label: tectyl 506'

My processing is failing because it considers the former output, but I want the latter.

Would anyone please shed light on this issue? How can I convert my string to the latter format?

Thank you!

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • `\xa0` is a non-breaking space character. Jupyter Notebook is showing the character code, Python is just printing it as a space. – Barmar Apr 29 '21 at 03:58
  • See: [How to remove \xa0 from string in Python?](https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python). The question is old and related to python2, but the answer with `unicodedata` still works in Python3. – Mark Apr 29 '21 at 04:08
  • Thank you, Barmar. I had never encountered it before. The issue is that I am searching for certain texts (not patterns) in the pdf, and with the \xa0, I cannot find them. I guess one simple solution is to re.sub('\xa0',' ', pdf_text) before searching for the interested texts.Do you have any alternative recommendations? – Julia Penfield Apr 29 '21 at 04:10
  • I just saw your reply, Mark. I work in python 3 btw. I tried my idea re.sub('\xa0',' ', pdf_text) and also pdf_text.replace(u'\xa0', u' ') from the link you shared. Both worked! Thank you. Issue is now resolved. – Julia Penfield Apr 29 '21 at 04:18

0 Answers0