I came to this question by It's title and was disappointed that the solution wasn't what I expected.
The accepted answer only works for titles of exactly 2 words
This code would return all of the tokens that are in title case, without assuming anything on the amount of words in the title
import re, collections
def title_case_to_token(c):
totoken = lambda s: s[0] + "<" + s[1:-2].replace(" ","_") + ">" + s[-2:]
tokenized = re.sub("([\s\.\,;]([A-Z][a-z]+[\s\.\,;])+[^A-Z])", lambda m: totoken(m.group(0))," " + c + " x")[1:-2]
tokens = collections.Counter(re.compile("<\w+>").findall(tokenized))
return (tokens, tokenized)
For example
text='Hi my name is Moh Shai and This Is a Python Code with Regex and Needs Some Expertise'
tokens, tokenized = title_case_to_token(text)
Value of tokens
Counter({'<Hi>': 1, '<Moh_Shai>': 1, '<This_Is>': 1, '<Python_Code>': 1, '<Regex>': 1, '<Needs_Some_Expertise>': 1})
Note that Needs_Some_Expertise
is also caught by this regex, and it has 3 words
Value of tokenized
<Hi> my name is <Moh_Shai> and <This_Is> a <Python_Code> with <Regex> and <Needs_Some_Expertise>