1

I am reading in a very messy file with very little (if any) format. I am looking for the following two of which I have working properly.

  • Name (first and last) working
  • Email addresses (varying types (eg. .edu .net .com) There could be others as well.) working
  • Employee number (two capital letters followed by 5 digit values then the same two letters as the first but reversed) NOT Working

The code I have currently for the Employee regex:

string employeeNumber = @"(?<grp1>[A-Z]{2})[0-9]{5}[A-Z]{2}";

This finds the required values, but would also find invalid employee numbers since it is not actually looking for the first two capital chars in the opposite order.

What I would like in the end is to some how use the <grp1> only in the reversed order.

Example of a valid employee number XY12345YX.

I could not find any good documentation on any type of regular expression group reversal. Any Ideas would be great!

EDIT:

This is an example of a line from a text document that I am reading in.

'Name list from PQP-97 system &%$ Bill Williams  MK12345KM bwilliams01@msn.com ^ %20% 
Fredericka Hanover GW22887WG freddie@verizon.net'
Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Mason Toy
  • 109
  • 9
  • For the employee number you can do: `([A-Z])([A-Z])[0-9]{5}\2\1`. Do you have spaces / delimiters between the data / could you provide some example data? – JohnLBevan Feb 19 '15 at 01:21
  • So i wouldn't try to do this entirely in regex. Maybe read all things that match 2 capital letters followed by 5 digits into an array with there index and then run through the array and use substrings to try to match them to the following two letters reversed. – ghostbust555 Feb 19 '15 at 01:24
  • 1
    @johnLBevan Yes, ill add that here in a second. It is very messy though. I will include a line or two of what is included. – Mason Toy Feb 19 '15 at 01:32

1 Answers1

2

Try this:

/.*?([A-Z][a-z]*)\s+([A-Z][a-z]*)\s+(([A-Z])([A-Z])[0-9]{5}\5\4)\s+\(\S+@\S+).*/g

Regex101 Demo: https://regex101.com/r/iB9vF2/2

  • Match1 = First Name
  • Match2 = Last Name
  • Match3 = Employee ID
  • Match4 = (ignore this; just used for finding employee id)
  • Match5 = (ignore this; just used for finding employee id)
  • Match6 = Email

Explanation:

.*? - ignore any rubbish before the first name

([A-Z][a-z]*) - first name begins with a capital followed by any number of lower case letters

\s+ - 1 or more spaces marks the end of the first name

([A-Z][a-z]*) - last name follows first name, and follows the same pattern

\s+ - last name terminated by space(s)

(([A-Z])([A-Z])[0-9]{5}\5\4) - employee id follows last name, in the format Capital1, Capital2 then 5 digits, then a repeat of Capital2 (match5) and Capital1 (match4)

\s+ - space(s) shows the end of the employee id

(\S+@\S+) - non space characters either side of an @ symbol make up the email*

.* - this just allows for junk on the end of the string. It won't match the mail, since the \S+ is greedy, but it will cater for any other character, thus also representing the end of the email.

* NB: the email regex is overly simple; should be enough for your needs, but this couldn't check for valid emails, since the rules around those are complex. Further reading: Using a regular expression to validate an email address

Community
  • 1
  • 1
JohnLBevan
  • 22,735
  • 13
  • 96
  • 178
  • Thanks, I have two additional questions. (really more advice I guess) So, I am currently reading each of the cases separately (name, email, employeeNum), would it be better / more efficient to return these all in the one match that you have done or should I keep them separate? The other question I had is for the employeeNumber, when you are capturing the first two capitals how are you referencing them at the end of that case, I think I am missing/not seeing that part in your example. At any rate thanks a ton for the explanation very helpful! – Mason Toy Feb 19 '15 at 02:13
  • 1
    I think I may see it now. Is it the `\5\4` ? – Mason Toy Feb 19 '15 at 02:15
  • Yes, that's it - count the open brackets to get the order of the matches (ignoring any non-capturing groups; though in this case there are none. I learnt most of what I know about regex in a day by playing this game - http://regexcrossword.com/ – JohnLBevan Feb 19 '15 at 02:17
  • As for whether to do things separately or at once it's not easy to say without knowing the full context... the advantage of doing it my way is you only look through the string once, rather than once per item, so it should be faster. Also if you know the order of the fields will be consistent (first, last, employee, mail) you get more hints to work off & can easily tell first name from last name despite both sharing a pattern. If the data's more chaotic than that though, there may be an advantage to doing each field separately. – JohnLBevan Feb 19 '15 at 02:20