0

I'm using an automatic ronconciliation tool which is based on RegEx. I want to match two names. Example: "John Francis Edward Smith" compare to "John Smith". Plus, since the names can contain errors I only compare 1st three letters of the first word, and the 1st three letters of the last word. Hence, the positive match here would be: "Joh" and "Smi" on both sides. I can build the expression (^\D{3}).*\s+(?=\S*$)(\D{3}).*$, but the problem is that the engine makes two groupings and uses the OR operator on them, whereas I need AND so both would need to be right. I've tried everything. Any suggestions?

Linger
  • 14,942
  • 23
  • 52
  • 79
  • Hi.I'm trying to match two names, as above. On one side I import bank statements and on the other outgoing payments. I need to compare them and find differences. So, I reconcile them through the automized tool that uses strictly RegEx. When it matches pairs, it removes them from the list, and the rest is done by hand. So you can have 'John Smith' from the bank and 'John Smoth' from the accountancy. – user2011163 Jan 25 '13 at 14:30
  • would the name always follow name-surname format – Anirudha Jan 25 '13 at 15:44
  • 1
    From the comments it sounds like your tool extracts groups and then compares some input and if any of the groups match it considers the entire string a match. It doesn't sound like there is a regex trick to get around that. – JDB Jan 25 '13 at 15:56
  • What is the name of the reconciliation tool? – robinCTS Jan 26 '13 at 20:57
  • @Some1.Kill.The.DJ Yes, it's always Name-Surname (might happen the other way round, but those we fix manually) – user2011163 Jan 30 '13 at 13:50
  • @Cyborgx37 The tool reads lines of text (Name-surname) from 2 different sources, then it performs given regEx sentence on them, and then it compares them. And if they match, they are removed from the list. The problem is when you use regEx that uses grouping it then extracts these groups and uses OR operator between them. So lets say you have 'John Smith' from 1 source and 'John Jones' from another. Regex extracts 'Joh' OR 'Smi' from 1s and 'Joh' OR 'Jon' from 2nd. This would make a match due to 'Joh', but of course it's not correct. If it was 'JohSmi' and 'JohJon' it would work. – user2011163 Jan 30 '13 at 14:00
  • possible duplicate of [Regular expression to skip character in capture group](http://stackoverflow.com/questions/277547/regular-expression-to-skip-character-in-capture-group) – JDB Jan 31 '13 at 14:13

2 Answers2

0

Assuming I understand your question correctly, this works for me

/^(\D{3}).*(\b[^\s]{3})/ 

^ anchors to start of line (\D{3}) captures first group .* greedily takes as much as possible \b finds a "word boundary" [^\s]{3} is three characters that are not whitespace I guess \S{3} would work too

The trick is that .*\b will find the last word boundary in the string

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • 1
    The expression does the trick, but the same one as I provided in the initial question. It builds two sub-groups. On my example one group is 'joh' and the other 'smi'. Then the engine makes a comparisson: 'joh' OR 'smi' against 'joh' OR 'smi'. In this case it works, but if the text is 'John Jones' agains 'John Smith' it would still have a positive match on the 'Joh'. – user2011163 Jan 25 '13 at 14:44
  • ^(joh).*(\bsmi)\S*$ matches ONLY "joh" and "smi" as pairs. If you need to look for multiple names with one regexp you'd have to use alternates with both names in like this ^(joh).*(\bsmi)\S*$|^(lar).*(\bwal)\S*$ – Vorsprung Jan 25 '13 at 20:19
  • Engine grabs 1 line of text from 2 sources. Then it executes both via applied regEx statement. The results are compared and removed if matching. Line of text is a name. I want to take 1st 3 letters of the 1st word and 1st 3 letters of the last word as a shortcut to avoid middle names and mistakes in names. If I make two groupings within my regEx statement as above then the engine uses AND operator (John Smith is successfuly matched to John Jones due to 'Joh'). Option would be to build one word from the parameters I get (exmpl: JohSmi compared to JohJon). Sort of concating groupings – user2011163 Jan 28 '13 at 14:26
0

If you need to avoid grouping, you could try something simple like

\bJoh.*\bSmi

This will match a string that contains "Joh" and "Smi" with the caveat that each three letter sequence starts a word (so it would not match "John ClineSmith")

EDIT

I'm not looking for John Smith specifically. I'm trying to extract 1st 3 letters of the name and 1st 3 letters of the last name, where the name-lastname input might have 1 or more middle names (exmpl: John Robert James Smith). But it can't be two groupings, the result has to be in one word (ie. 'JohSmi' in upper example).

Sorry to be the bearer of bad news, but what you are asking cannot be done purely in regular expressions. Regular expressions are meant to match a sequence of characters, one after the other, without breaking. You can use grouping to extract a sub-sequence from the final match or you can perform multiple matches, but a regex match will always return an unbroken sequence from the first matched character to the last (no skipping).

What you are asking for is a regex that returns 3 characters from the beginning of a match and 3 from the end without any of the characters in-between. This is a broken sequence and no regex engine I am aware of is capable of doing this. You will either have to use additional code (php or whatever your tool is) or abandon this method and try to find an alternative.

This question covers the same ground: Regular expression to skip character in capture group

Community
  • 1
  • 1
JDB
  • 25,172
  • 5
  • 72
  • 123
  • I'm not looking for John Smith specifically. I'm trying to extract 1st 3 letters of the name and 1st 3 letters of the last name, where the name-lastname input might have 1 or more middle names (exmpl: John Robert James Smith). But it can't be two groupings, the result has to be in one word (ie. 'JohSmi' in upper example). – user2011163 Jan 31 '13 at 13:59