57

I have technical strings as the following:

"The thing P1 must connect to the J236 thing in the Foo position."

I would like to match with a regular expression those only-in-uppercase words (namely here P1 and J236). The problem is that I don't want to match the first letter of the sentence when it is a one-letter word.

Example, in:

"A thing P1 must connect ..." 

I want P1 only, not A and P1. By doing that, I know that I can miss a real "word" (like in "X must connect to Y") but I can live with it.

Additionally, I don't want to match uppercase words if the sentence is all uppercase.

Example:

"THING P1 MUST CONNECT TO X2."

Of course, ideally, I would like to match the technical words P1 and X2 here but since they are "hidden" in the all-uppercase sentence and since these technical words have no specific pattern, it's impossible. Again I can live with it because all-uppercase sentences are not so frequent in my files.

Thanks!

Pavlo Zhukov
  • 3,007
  • 3
  • 26
  • 43
Patrick
  • 2,577
  • 6
  • 30
  • 53
  • 1
    Do all of the technical terms contain numbers? – Jay Jan 04 '11 at 20:56
  • 6
    Whatever you do, don’t use 7-bit literals likes `[A-Z]`. That’s very RADIX-50, and has no place in code written over the last few decades. Use something that works on any text. Minimally that means using something related to `\w` or `[[:alpha:]]` or `\pL` or `\p{Alphabetic}`, depending on your regex language and environment. In fact, implementations vary so much that some of those may be legal and right on some platforms but legal and wrong on others. – tchrist Jan 04 '11 at 22:01

6 Answers6

91

To some extent, this is going to vary by the "flavour" of RegEx you're using. The following is based on .NET RegEx, which uses \b for word boundaries. In the last example, it also uses negative lookaround (?<!) and (?!) as well as non-capturing parentheses (?:)

Basically, though, if the terms always contain at least one uppercase letter followed by at least one number, you can use

\b[A-Z]+[0-9]+\b

For all-uppercase and numbers (total must be 2 or more):

\b[A-Z0-9]{2,}\b

For all-uppercase and numbers, but starting with at least one letter:

\b[A-Z][A-Z0-9]+\b

The granddaddy, to return items that have any combination of uppercase letters and numbers, but which are not single letters at the beginning of a line and which are not part of a line that is all uppercase:

(?:(?<!^)[A-Z]\b|(?<!^[A-Z0-9 ]*)\b[A-Z0-9]+\b(?![A-Z0-9 ]$))

breakdown:

The regex starts with (?:. The ?: signifies that -- although what follows is in parentheses, I'm not interested in capturing the result. This is called "non-capturing parentheses." Here, I'm using the paretheses because I'm using alternation (see below).

Inside the non-capturing parens, I have two separate clauses separated by the pipe symbol |. This is alternation -- like an "or". The regex can match the first expression or the second. The two cases here are "is this the first word of the line" or "everything else," because we have the special requirement of excluding one-letter words at the beginning of the line.

Now, let's look at each expression in the alternation.

The first expression is: (?<!^)[A-Z]\b. The main clause here is [A-Z]\b, which is any one capital letter followed by a word boundary, which could be punctuation, whitespace, linebreak, etc. The part before that is (?<!^), which is a "negative lookbehind." This is a zero-width assertion, which means it doesn't "consume" characters as part of a match -- not really important to understand that here. The syntax for negative lookbehind in .NET is (?<!x), where x is the expression that must not exist before our main clause. Here that expression is simply ^, or start-of-line, so this side of the alternation translates as "any word consisting of a single, uppercase letter that is not at the beginning of the line."

Okay, so we're matching one-letter, uppercase words that are not at the beginning of the line. We still need to match words consisting of all numbers and uppercase letters.

That is handled by a relatively small portion of the second expression in the alternation: \b[A-Z0-9]+\b. The \bs represent word boundaries, and the [A-Z0-9]+ matches one or more numbers and capital letters together.

The rest of the expression consists of other lookarounds. (?<!^[A-Z0-9 ]*) is another negative lookbehind, where the expression is ^[A-Z0-9 ]*. This means what precedes must not be all capital letters and numbers.

The second lookaround is (?![A-Z0-9 ]$), which is a negative lookahead. This means what follows must not be all capital letters and numbers.

So, altogether, we are capturing words of all capital letters and numbers, and excluding one-letter, uppercase characters from the start of the line and everything from lines that are all uppercase.

There is at least one weakness here in that the lookarounds in the second alternation expression act independently, so a sentence like "A P1 should connect to the J9" will match J9, but not P1, because everything before P1 is capitalized.

It is possible to get around this issue, but it would almost triple the length of the regex. Trying to do so much in a single regex is seldom, if ever, justfied. You'll be better off breaking up the work either into multiple regexes or a combination of regex and standard string processing commands in your programming language of choice.

Jay
  • 56,361
  • 10
  • 99
  • 123
  • Thanks! My case would be "all-uppercase and number". The problem with the solution you propose is that it will match the A in "A thing P1 connect to XYZ". – Patrick Jan 04 '11 at 21:15
  • @Patrick These don't match A. The first three require two or more characters, and the last one requires only one or more, but it can't be at the beginning. – Jay Jan 04 '11 at 21:17
  • Sorry Jay, I didn't see the granddaddy part in your first post. Unfortunately, when I try it with preg_replace, it returns message: "Warning: preg_replace(): Compilation failed: lookbehind assertion is not fixed length at offset 32 in Command line code on line 1" – Patrick Jan 05 '11 at 13:47
  • @Patrick Like I said, it will vary by RegEx flavour, and I didn't know what you were using. Not every type of RegEx uses the same symbols, and not every type supports the same features. The examples given are based on .NET Regex; sorry that it isn't working for you. – Jay Jan 05 '11 at 16:53
  • Jay, could I ask you to explain the different parts of your "granddaddy", I'm trying to understand it to adapt it to my PCRE flavor. Thanks again! – Patrick Jan 06 '11 at 16:02
  • My God, if I could vote, you would be number 1! Thank you very much Jay, it's very appreciated. – Patrick Jan 06 '11 at 22:32
  • But Jay, if I understand well (I cannot test it at this point as I said before), why does A in "A thing that connects P to X." not matched? 'A' will not match the first clause but will match the second and since it's an OR and not and AND, it will be sufficient. Am I missing something? – Patrick Jan 07 '11 at 16:26
  • @Patrick The second clause starts with the negative lookbehind `(?<!^[A-Z0-9]*)` The `*` means zero-or-more, as opposed to `+`, which is used elsewhere and means one-or-more. The start-of-line is denoted by `^`, so if what precedes is only the start-of-line and ZERO-or-more capital letters or numbers, the match fails. – Jay Jan 07 '11 at 16:38
  • @Jay: First, thanks again for the explanations! But I'm not sure I understand it. Let's simplify the problem, forget the exclusion of the all-uppercase sentences and the non-capturing parentheses. Your regex would be: (?<!^)[A-Z]\b|\b[A-Z0-9]+\b. The PHP command preg_match_all on "A thing that connects P to X." will return A,P and X as matches, won't it? In my understanding, 'A' will be a match because it matches the second clause of the OR. – Patrick Jan 07 '11 at 21:53
  • @Patrick That is correct. It is that negative lookbehind in my second clause that causes `A` not to match. If you weren't concerned about all-caps sentences, the second clause could be reduced to `(?:^[A-Z0-9]{2,}|(?<!^)\b[A-Z0-9]+)\b`. Here, the first part of the alternation matches at the beginning of the line, and the name must be 2 or more caps or nums. The second part of the alternation matches everywhere NOT at the beginning of the line. – Jay Jan 07 '11 at 22:18
  • @Jay: Let's simplify further. Suppose I want to match only single-letter words excluding again those ones that start a string. At first, I would put: (?<!^)[A-Z]\b. Of course the problem with this is that 'C' and 'X' will be matches in "A thing that connects PC and X". I would want to express something like "a letter that is not preceded by the beginning of the string *nor* by a letter but it seems to be impossible to write a regular expression in a lookbehind, something like (?<![A-Z^])[A-Z]\b does not work. Is there a way to do that? – Patrick Jan 07 '11 at 22:22
  • Ooooh, ok I see now. Thank you so much Jay, your help was very appreciated! – Patrick Jan 07 '11 at 22:31
  • +1 just for the work you have put into answering this question, regex is already hard to read but your 4th one is just mind blowing – Rand Random May 12 '15 at 12:07
  • If this does not work, you might be using a REGEX dialect in which you should _1._ use ´\(´ and `\)` instead of `(:?` and `)` _2._ `\<` for bigin of word and `\>` for word boundary instead of `\b` for word boundary. – Dirk Horsten Sep 25 '17 at 07:46
7

Maybe you can run this regex first to see if the line is all caps:

^[A-Z \d\W]+$

That will match only if it's a line like THING P1 MUST CONNECT TO X2.

Otherwise, you should be able to pull out the individual uppercase phrases with this:

[A-Z][A-Z\d]+

That should match "P1" and "J236" in The thing P1 must connect to the J236 thing in the Foo position.

Upgradingdave
  • 12,916
  • 10
  • 62
  • 72
  • on the all caps check, I think space is in \W, then adding _ and assuming no further check is necessary on an empty string, it could be generalized to `/^[A-Z\d\W_]*$/` –  Jan 04 '11 at 23:51
6

Don't do things like [A-Z] or [0-9]. Do \p{Lu} and \d instead. Of course, this is valid for perl based regex flavours. This includes java.

I would suggest that you don't make some huge regex. First split the text in sentences. then tokenize it (split into words). Use a regex to check each token/word. Skip the first token from sentence. Check if all tokens are uppercase beforehand and skip the whole sentence if so, or alter the regex in this case.

Radu Simionescu
  • 4,518
  • 1
  • 35
  • 34
5

Why do you need to do this in one monster-regex? You can use actual code to implement some of these rules, and doing so would be much easier to modify if those requirements change later.

For example:

if(/^[A-Z0-9\s]*$/)
    # sentence is all uppercase, so just fail out
    return 0;

# Carry on with matching uppercase terms
Anon.
  • 58,739
  • 8
  • 81
  • 86
  • Actually, I have a set of regexes that are contained in a mySQL table and my php code executes all these preg_replace() in sequence. That's why I didn't want to add complexity by adding if's. Of course, if it's impossible to do otherwise, I will maybe change my mind... – Patrick Jan 04 '11 at 21:13
  • 1
    Good question. The logic is stored in a database because ultimately, it is the user's responsibility to enter (via a webform) the regexes that will be applied to a specific text. My program loops over these regexes and returns the matches. – Patrick Jan 05 '11 at 12:58
3

I'm not a regex guru by any means. But try:

<[A-Z0-9][A-Z0-9]+>

<           start of word
[A-Z0-9]    one character
[A-Z0-9]+   and one or more of them
>           end of word

I won't try for the bonus points of the whole upper case sentence. hehe

Craig Celeste
  • 12,207
  • 10
  • 42
  • 49
2

For the first case you propose you can use: '[[:blank:]]+[A-Z0-9]+[[:blank:]]+', for example:

echo "The thing P1 must connect to the J236 thing in the Foo position" | grep -oE '[[:blank:]]+[A-Z0-9]+[[:blank:]]+'

In the second case maybe you need to use something else and not a regex, maybe a script with a dictionary of technical words...

Cheers, Fernando

Fernando
  • 1,382
  • 8
  • 17
  • I'm upvoting this because of the idea to use a dictionary of technical terms. Since the OP already identified in other comments that a database is available, it seems to make much more sense to find the interesting terms using that sort of information, rather than an attempt to recognize them based on an imperfect convention. – Zac Thompson Jan 04 '11 at 22:20
  • Well, it's true that a database is available but I also mentioned that there is no specific pattern for the technical words. – Patrick Jan 05 '11 at 12:46