I am currently writing a script that will extract datapoints and order numbers from a large series of PDF documents. I am using PyPdf to convert the pdf to a txt document then attempting to use re.search to pull the data that matches the formatting. The issue I am having is that I am unable to successfully find order numbers in about half of the documents. I believe this is an issue caused by the hyphen in the middle of the order number.
The order number has the format A0A000-A00 and the python script will find about half the cases using
re.search(r"([A-Z]\d{1}[A-Z]\d{3}-[A-Z]\d{2})",line)
.
While using regex101 I noticed that some of the files use a unicode hyphen and appear in as U+00AD which is fine but I have no idea how to sanitize this because in the txt file it just appears as a hyphen.
Attempting to sanitize while converting from pdf to txt does not work either; using txt.write(fileText.replace('-',''))
will only replace about half of the files and txt.write(fileText.replace('U+00AD',''))
does nothing to any of the files.
Edit: By changing the code to
txt.write(fileText.encode('utf-8').decode('utf-8').replace('\u00ad','-'))
I seemingly have been able to fix the issue with unicode, at least when I bring the copy-paste txt file into regex101. The issue with not finding all instances of the order number persists.