11

I have been trying to extract certain text from PDF converted into text files. The PDF came from various sources and I don't know how they were generated.

The pattern I was trying to extract was a simply two digits, follows by a hyphen, and then another two digits, e.g. 12-34. So I wrote a simple regex \d\d-\d\d and expected that to work.

However when I test it I found that it missed some hits. Later I noted that there are at least two hyphens represented as \u2212 and \xad. So I changed my regex to \d\d[-\u2212\xad]\d\d and it worked.

My question is, since I am going to extract so many PDF that I don't know what other variations of hyphen are out there, is there any regex expression covering all "hyphens", and hopefully looks better than the [-\u2212\xad] expression?

João Pimentel Ferreira
  • 14,289
  • 10
  • 80
  • 109
Kenneth L
  • 260
  • 2
  • 14
  • No, you must indeed decide what characters count as "hyphen" and include them manually. Also, `U+2212` is not a hyphen (it's a mathematical minus) and neither is `U+00AD` (this is a soft "breaking" hyphen). – Jongware Feb 22 '18 at 09:24
  • 1
    @usr2564301 Thanks for your comment, but I don't want to distinguish them as long as they looks like a hyphen. I cannot control the input as they were converted from various PDF files. So is there any regex representation for "anything that looks like a hyphen, a minus, an em dash, an en dash or similar character"? – Kenneth L Feb 22 '18 at 09:28
  • 1
    `\p{Pd}` from [matching-unicode-dashes-in-java-regular-expressions](https://stackoverflow.com/questions/3045511/matching-unicode-dashes-in-java-regular-expressions) – Nahuel Fouilleul Feb 22 '18 at 09:29
  • @KennethL, if you don't mind to distinguish a hyphen from a mathematica minus sign, and only want to match _anything that remotely resembles a hyphen_ why not use `\d\d.\d\d` as your regexp (this will match all possible hyphens available in unicode ---and what are not hyphens also, but they can resemble a hyphen, depending how open your mind is :) ) – Luis Colorado Feb 24 '18 at 08:02
  • @LuisColorado thanks for your suggestion for `\d\d.\d\d` but I need to exclude patterns like `12345`. Thanks for reminding that I can change my requirement as well. – Kenneth L Mar 02 '18 at 09:22
  • @KennethL, then use `\d\d\D\d\d` and next ask for how to match `12334.4566`. If I had an answer to this question I should have published one... don't take me seriously in these comments. Btw, how your proposed regex behaves with things like `1234456-23453445` does it get `56-23` only? Is that valid? where do you put the actual requirements? – Luis Colorado Mar 02 '18 at 13:44
  • @LuisColorado Truly appreciate your input. Yes `1234456-23453445` get `56-23` matches my expectation as well. The accepted answer below gave me a reasonably good solution indeed. – Kenneth L Mar 03 '18 at 01:12

2 Answers2

9

The solution you ask for in the question title implies a whitelisting approach and means that you need to find the chars that you think are similar to hyphens.

You may refer to the Punctuation, Dash Category, that Unicode cateogry lists all the Unicode hyphens possible.

You may use a PyPi regex module and use \p{Pd} pattern to match any Unicode hyphen.

Or, if you can only work with re, use

[\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]

You may expand this list with other Unicode chars that contain minus in their Unicode names, see this list.

A blacklisting approach means you do not want to match specific chars between the two pairs of digits. If you want to match any non-whitespace, you may use \S. If you want to match any punctuation or symbols, use (?:[^\w\s]|_).

Note that the "soft hyphen", U+00AD, is not included into the \p{Pd} category, and won't get matched with that construct. To include it, create a character class and add it:

[\xAD\p{Pd}]
[\xAD\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D]
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    ... and even chinese symbols that resemble a hyphen, or anything that remotely resembles a hyphen, as `_` underscore, or `=` equal sign, or `¬` logic negation, or `~` tilde (well, it resembles vaguely a hyphen), right? – Luis Colorado Feb 24 '18 at 08:05
  • @LuisColorado Whatever OP wants to include into the list of "chars resembling a hyphen". – Wiktor Stribiżew Feb 24 '18 at 08:09
  • I'm afraid, any of the samples I've put is not in your list (and in the Punctuation Dash cathegory there are double hyphens ---excluding the equals mathematical sign---, and the `WAVE DASH` ---which resembles a tilde, that has another codepoint---) I think using `\d\d\D\d\d` should give better results once reached this point. – Luis Colorado Feb 24 '18 at 08:16
  • @LuisColorado I answered the problem stated in the question title. I see your point, but in that case, I would not use `.` (as in your comment to the question) as it may match a digit, nor `\D` as it may match whitespace. In this case, a `\d\d(?:[^\w\s]|_)\d\d` to match a char other than whitespace, letters and digits between two pairs of digits. Basically, `(?:[^\w\s]|_)` matches any punctuation and symbols. – Wiktor Stribiżew Feb 24 '18 at 08:26
  • Well, the comment is not a comment for you, but for the OP. The problem here is that the question is so ambiguous that it's very difficult to state what should be considered a hyphen, mostly in unicode terms. **You have done a very good work** searching for them, but my comment was not for you. Case I had a good answer, I would have written one, and not only put comments to other answers. Even in the case you present, a sequence like `1234-5678` would be matched as `34-56` (which I assume bad behaviour) so consider it as a difficult question to answer. – Luis Colorado Feb 24 '18 at 09:06
2

This is also a possible solution, if your regex engine allows it

/\p{Dash}/u

This will include all these characters.

João Pimentel Ferreira
  • 14,289
  • 10
  • 80
  • 109