0

I am currently writing a script that will extract datapoints and order numbers from a large series of PDF documents. I am using PyPdf to convert the pdf to a txt document then attempting to use re.search to pull the data that matches the formatting. The issue I am having is that I am unable to successfully find order numbers in about half of the documents. I believe this is an issue caused by the hyphen in the middle of the order number.

The order number has the format A0A000-A00 and the python script will find about half the cases using

re.search(r"([A-Z]\d{1}[A-Z]\d{3}-[A-Z]\d{2})",line).

While using regex101 I noticed that some of the files use a unicode hyphen and appear in as U+00AD which is fine but I have no idea how to sanitize this because in the txt file it just appears as a hyphen.

Attempting to sanitize while converting from pdf to txt does not work either; using txt.write(fileText.replace('-','')) will only replace about half of the files and txt.write(fileText.replace('U+00AD','')) does nothing to any of the files.

Edit: By changing the code to

txt.write(fileText.encode('utf-8').decode('utf-8').replace('\u00ad','-'))

I seemingly have been able to fix the issue with unicode, at least when I bring the copy-paste txt file into regex101. The issue with not finding all instances of the order number persists.

  • Just replace `-` with `[\u00AD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]` – Wiktor Stribiżew Feb 09 '23 at 15:09
  • It is your task to sanitize the inputs. Just search the possible values of the hyphen (mins, etc.). I would just check the text (e.g. with your search and printing all different hyphens (adapt your search). Or go to Wikipedia (you find everything there): https://en.wikipedia.org/wiki/Hyphen#Unicode and so you get many variation of hyphens. – Giacomo Catenazzi Feb 09 '23 at 15:11

0 Answers0