0

I'm trying to a find a way to determine whether a string contains at least n number of character in a specific order.

I am processing an enormous amount of data written by hand and the amount of typos is pretty crazy.

I need to find text parts in a large string looking something like:

irrelevant text MONKEY, CHIMP: more irrelevant text

I need to find MONKEY, CHIMP:

The ways this is mistyped is pretty crazy. Here is an extra weird example:

MonKEY , CHIMp :

I've got to a point in my regex where I'm able to find all of these occurances. Probably not the nicest solution, but here it is:

 (m|M)(o|O)(n|N)(k|K)(e|E)(y|Y),?\s+(c|C)(h|H)(i|I)(m|M)(p|P)(\s+)?:

Looks a bit weird but it works.

Unfortunately the weirdness does not stop here. I need to amend this regex so that it also allows for 1 missing letter in each word.

So I would need to amend this regex so it would also work for something like:

MonKEY , CIMp :

onKEY , ChIMp :

onKEY , CIMp :

I would think that there should be a way to tell the regex that it should require wordlength-1 exact number of characters to match.

Is there a simple way to do this?

I'm been looking into {4, } but I'm not sure this is the right direction or if it could be applied here.

Thank in advance, Peter

Community
  • 1
  • 1
Peter Jaloveczki
  • 2,039
  • 1
  • 19
  • 35
  • 1
    You can make the Regex a lot easier if you normalize the text by putting it to lower case for example. – Markus Jun 27 '17 at 15:02
  • 2
    Or by doing a case insensitive match. See https://stackoverflow.com/questions/3436118/is-java-regex-case-insensitive – GhostCat Jun 27 '17 at 15:03
  • Regex alone might not be enough for a scalable solution. You might end up needing your own parser evaluating similarities with a dictionary word, e.g. with a Levenshtein distance metric. – Mena Jun 27 '17 at 15:07
  • Have you considered using a more advanced algorithm than regex? E.g. https://commons.apache.org/proper/commons-text/jacoco/org.apache.commons.text.similarity/LevenshteinDistance.java.html – Tom Lord Jun 27 '17 at 15:08
  • This is horrible for Regex (it just gets way too long). You should probably use [Fuzzy text search](https://stackoverflow.com/questions/327513/fuzzy-string-search-library-in-java) instead. – Tezra Jun 27 '17 at 15:19

3 Answers3

1

With pure regex, then best you could do is something like (whitespace added for readability):

/
  ^
  (
    monkey\s*,?\s*chimp\s*:
  |
    onkey\s*,?\s*chimp\s*:
  |
    mnkey\s*,?\s*chimp\s*:
  |
    ...
  )
  $
/ix

However, this is a very long-winded approach and still won't account for all sorts of other fuzzy-matches like "Monkey, Chinp:" or "Monkey; Chimp:".


An alternative approach you could take is to first check the length of the string:

/^\w{10,15}$/

and then perform some very-fuzzy match on it:

/m?o?n?k?e?y?\s*,?\s*c?h?i?m?p?\s*:/i

However, you'd need to be careful here since there may be some bizarre results included in the match list, such as:

"mon      c:"

I would recommend taking a different, non-regex approach of utilising a Levenshtein Distance library. This will allow you to set generic boundaries on "how closely the string needs to match Monkey, Chimp"

Tom Lord
  • 27,404
  • 4
  • 50
  • 77
0

^\w{10,10}$ # allows words of exactly 10 characters. Set it to length - 1. Then make each of the characters optional.

I think just {10} works as well.

Gilrich
  • 305
  • 3
  • 13
  • You can just write `\w{10}`; there's no need to suggest `\w{10,10}`. However, this does not answer OP's problem: They wanted a pattern that would also match, for example, `"MonKEY , CIMp"` - which is **13** characters. – Tom Lord Jun 27 '17 at 15:24
  • Thats why I wrote, that then each character should be made optional. So that it can be left out. – Gilrich Jun 27 '17 at 15:27
0

You can use regex like this, this is not very beautiful but your example is strange too

First use case insensitive :(https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#CASE_INSENSITIVE)

I don't know solution in one treatment but you can first check for "m?o?n?k?e?y?\s+,?\s+c?h?i?m?p?(\s+)?:" and then for length in another test, this will be easy

azro
  • 53,056
  • 7
  • 34
  • 70
  • So things as fuzzy as `mon , p`will match? Doesn't sound very reliable to me... – Tom Lord Jun 27 '17 at 15:12
  • @TomLord First thanks for this so constructive and useful comment. Then I as wrote you might add also a test for length, and other ones if needed, I just add the structure which the order and the possibility to have letter less, if you have better idea just click on the blue button down called "Post your answer" – azro Jun 27 '17 at 15:15
  • StackOverflow has truncated my comment, so perhaps you misunderstood due to the formatting. There are meant to be **several** spaces on either side of the comma - so the length check will still pass. You can also see [my answer](https://stackoverflow.com/a/44783986/1954610) below. – Tom Lord Jun 27 '17 at 15:18