2

I need to find a combination of 2 consecutive title case words.

This is my code so far,

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'

rex=r'[A-Z][a-z]+\s+[A-Z][a-z]+'

re.findall(rex,text)

This gives me,

['Moh Shai', 'This Is', 'Python Code', 'Needs Some']

However, I need all the combinations. Something like,

['Moh Shai', 'This Is', 'Python Code', 'Needs Some','Some Expertise']

Can someone please help?

Md. Mohsin
  • 1,822
  • 3
  • 19
  • 34
  • Does [this](http://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches) help? – TigerhawkT3 Apr 19 '16 at 23:34
  • 2
    If you can install a third-party module, the easiest way is with the [regex module](https://pypi.python.org/pypi/regex), which supports an `overlapped=True` flag on `findall()`. – kindall Apr 19 '16 at 23:39
  • @kindall you are awesome. That works great! Can you please post an answer so I may accept? – Md. Mohsin Apr 19 '16 at 23:41
  • Please see: http://stackoverflow.com/questions/5616822/python-regex-find-all-overlapping-matches – user3516726 Apr 19 '16 at 23:49

3 Answers3

4

You can use a regex lookahead in combination with the re.finditer function in order to get the desired outcome:

import re

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
rex=r'(?=([A-Z][a-z]+\s+[A-Z][a-z]+))'

matches = re.finditer(rex,text)
results = [match.group(1) for match in matches]

Now results will contain the information you need:

>>> results
['Moh Shai', 'This Is', 'Python Code', 'Needs Some', 'Some Expertise']

edit: For what it's worth, you don't even really need the finditer function. You can replace those bottom two lines with your previous line re.findall(rex,text) for the same effect.

Right Of Zen
  • 843
  • 1
  • 7
  • 11
3

I came to this question by It's title and was disappointed that the solution wasn't what I expected.

The accepted answer only works for titles of exactly 2 words

This code would return all of the tokens that are in title case, without assuming anything on the amount of words in the title

import re, collections
def title_case_to_token(c):
    totoken = lambda s: s[0] + "<" + s[1:-2].replace(" ","_") + ">" + s[-2:]
    tokenized = re.sub("([\s\.\,;]([A-Z][a-z]+[\s\.\,;])+[^A-Z])", lambda m: totoken(m.group(0))," " + c + " x")[1:-2]
    tokens = collections.Counter(re.compile("<\w+>").findall(tokenized))
    return (tokens, tokenized)

For example

text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
tokens, tokenized = title_case_to_token(text)

Value of tokens

Counter({'<Hi>': 1, '<Moh_Shai>': 1, '<This_Is>': 1, '<Python_Code>': 1, '<Regex>': 1, '<Needs_Some_Expertise>': 1})

Note that Needs_Some_Expertise is also caught by this regex, and it has 3 words

Value of tokenized

<Hi> my name is <Moh_Shai> and <This_Is> a <Python_Code> with <Regex> and <Needs_Some_Expertise>
Uri Goren
  • 13,386
  • 6
  • 58
  • 110
1

If you can install a third-party module, the easiest way is with the regex module, which supports an overlapped=True flag on findall().

kindall
  • 178,883
  • 35
  • 278
  • 309