I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.
The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d
and expected that to work.
However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212
and \xad
. So I changed my regex to \d\d[-\u2212\xad]\d\d
and it worked.
My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad]
expression?