1

I'm monitoring incoming e-mail subjects, and each subject may contain a particularly formatted code inside it which I used to reference something else with down the line.

These codes can be anywhere within the string, and sometimes not at all - and so the problem I'm having is my lack of RegEx skills (which I assume is the best option for this solution?).

An example of a subject would be:

"Please refer to reference MZ5051CLA"
or
"Attention for Mr Danshi, RE. 11123MTX"

The codes I'm looking to extract in these scenarios are "MZ5051CLA" and "11123MTX".

The format of MZ5051CLA will be:
  - Always starts with "MZ"
  - Follows by a number
  - Always ends with "CLA"

Is there a simple way to evaluate the subject as a whole and extract any words that match the codes only?

I've looked at various solutions to my problem here on SO, but they're either overly complicated or I can't quite relate.

Edit:

As ShashishChandra pointed out, the idea is to monitor multiple mailboxes, each with their own code formats. So my idea was to implement a regex setting for each mailbox.

Perhaps this was important to mention initially, since a solution to catch all formats in one regex won't work. Apologies for that.

Daniel Minnaar
  • 5,865
  • 5
  • 31
  • 52
  • 1
    you said it always starts with MZ then why you want the second one? – Avinash Raj Jul 07 '14 at 08:42
  • From what i have understood from your question is, you wanted to extract a word that should contain both uppercase letter (>1 i presume) and numerical (>1 i presume) only. Regex can be easily be formed with these as standards. see-> [http://stackoverflow.com/questions/7684815/regex-for-alpanumeric-with-at-least-1-number-and-1-character]. – Shashish Chandra Jul 07 '14 at 08:55
  • 1
    @drminnaar - Correct me if I am wrong, there could be more cases than just these two cases, right e.g. `RR12RA2` or `001A3`. So, in that case you should better use NLP algorithms rather than just finding regex and matching them. That would be much better. Regex for these kinds of expression usually give false positive many a times. – Shashish Chandra Jul 07 '14 at 09:26
  • @ShashishChandra Correct - The idea behind my question (perhaps I should have mentioned this), is to have many monitors that run their own evaluations on different mailboxes. Having some sort of setting with it's own regex string per mailbox will allow me to only look for those codes relevant to the context of the mailbox. I'm not too familiar with NLP, but based on a quick glance, it looks a little to complex to justify if I don't mind producing false-positives (since the codes are verified later in the workflow anyway). – Daniel Minnaar Jul 07 '14 at 09:36

4 Answers4

2

Try this regex:

^.*(?:(MZ\d+CLA)|RE\.\s+(\d+MTX))$

Regular expression visualization

Demo

Stephan
  • 41,764
  • 65
  • 238
  • 329
1

The below regex would match only the first string MZ5051CLA

\bMZ\d+CLA\b 

DEMO

But this would match the both strings MZ5051CLA and 11123MTX,

\b[A-Z0-9]+$

All alphanumeric characters present at the last of a line are matched.

DEMO

This would get you the Alphanumeric string which starts with MZ and ends with CLA or starts with a number and ends with mtx

(?:\b[A-Z0-9]+$|\b\d+MTX\b)

DEMO

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • I think the question is not just for these two specific cases, it is more generic and also, your second regex will also match with e.g. `11111` and `AAAA`. – Shashish Chandra Jul 07 '14 at 08:58
  • @ShashishChandra added . Now it seems generic. – Avinash Raj Jul 07 '14 at 09:03
  • Yes, after your editing, it not only works fine for these two specific cases but it also works fine for numbers and uppercase letters. E.g. `A` or my age is `23`, you see. – Shashish Chandra Jul 07 '14 at 09:09
  • The expression (?:\b[A-Z0-9]+$|\b\d+MTX\b) will allow any alphanumeric prefix to be produced as a positive result, not specifically those starting with "MZ". This is why I need to evaluate them separately. – Daniel Minnaar Jul 07 '14 at 09:45
  • see http://regex101.com/r/eJ9fY2/4 . Now it looks the alphanumeric characters only after the `RE.` OR the one starts with MZ. – Avinash Raj Jul 07 '14 at 09:48
  • @drminnaar- then why not use just `[A-Z0-9]+` for your convenience. even just `\W` can produce the alphanumeric characters. Now, guess what each capitalized words for e.g. MBA, CA or even numbers will match and you are willing to sort out your desired results out of it? My friend you are stepping in a wrong direction. This will complicate things and not help you the way you want. Just try yourself. – Shashish Chandra Jul 07 '14 at 09:55
1

Both Codes in One Pattern

It seems that the codes must include at least one uppercase letter and at least one digit. For that kind of pattern, a password-validation technique is commonly used, and I would suggest:

\b(?=[A-Z0-9]*[A-Z])[A-Z0-9]*[0-9][A-Z0-9]*

In the demo, see how only the correct groups are matched. Of course false positives are possible.

Reference

zx81
  • 41,100
  • 9
  • 89
  • 105
  • Appreciate the response, but after some consideration I realized that I need to evaluate codes individually. I need to know which regex produced the positive match. – Daniel Minnaar Jul 07 '14 at 09:42
0

So, in that case if you don't mind false positives, then use: /^(?=.*[0-9])(?=.*[A-Z])([A-Z0-9]+)$/. This will work well in general.

Shashish Chandra
  • 489
  • 2
  • 5
  • 20