150

What is the best way to split a string like "HELLO there HOW are YOU" by upper-case words?

So I'd end up with an array like such: results = ['HELLO there', 'HOW are', 'YOU']

I have tried:

p = re.compile("\b[A-Z]{2,}\b")
print p.split(page_text)

It doesn't seem to work, though.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • 6
    When you say something doesn't work, you should explain why. Do you get an exception? (If so, post the whole exception) Do you get the wrong output? – Gareth Latty Nov 03 '12 at 12:44

3 Answers3

170

I suggest

l = re.compile("(?<!^)\s+(?=[A-Z])(?!.\s)").split(s)

Check this demo.

Ωmega
  • 42,614
  • 34
  • 134
  • 203
  • 5
    what happens when you dont use compile ? – Feelsbadman Jan 08 '19 at 09:35
  • 4
    Per the [re docs](https://docs.python.org/2/library/re.html), "*most regular expression operations are available as module-level functions and RegexObject methods. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.*" You can use `re.split(re.split(pattern, string, maxsplit=0, flags=0))` as mentioned in the previously cited docs. – ZaydH Apr 23 '19 at 08:59
70

You could use a lookahead:

re.split(r'[ ](?=[A-Z]+\b)', input)

This will split at every space that is followed by a string of upper-case letters which end in a word-boundary.

Note that the square brackets are only for readability and could as well be omitted.

If it is enough that the first letter of a word is upper case (so if you would want to split in front of Hello as well) it gets even easier:

re.split(r'[ ](?=[A-Z])', input)

Now this splits at every space followed by any upper-case letter.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • 1
    How would I change `re.split(r'[ ](?=[A-Z]+\b)', input)` so it didn't find upper case letters? E.g. It wouldn't match "A"? I tried `re.split(r'[ ](?=[A-Z]{2,}+\b)', input)`. thanks! –  Nov 03 '12 at 12:51
  • @JamesEggers You mean that you want to require at least two upper-case letters, so that you do not split at words like `I`? `re.split(r'[ ](?=[A-Z]{2,}\b)', input)` should do it. – Martin Ender Nov 03 '12 at 12:55
  • 2
    I'd suggest at least `[ ]+` or maybe even `\W+` to catch slightly more cases. Still, a good answer. – georg Nov 03 '12 at 13:03
  • I tried the same approach. However, having a `[ ]` did not work for me. Instead, I used `\s`. The complete regexp that worked for me was `re.split("\s(?=[A-Z]+\s)", string)` – Jitendra May 25 '20 at 00:55
1

Your question contains the string literal "\b[A-Z]{2,}\b", but that \b will mean backspace, because there is no r-modifier.

Try: r"\b[A-Z]{2,}\b".

chb
  • 1,727
  • 7
  • 25
  • 47
druid62
  • 109
  • 3