How to separate a irregularly cased string to get the words? - Python

Question

I have the following word list.

as my words are not all delimited by capital latter. the word list would consist words such as 'USA' , I am not sure how to do that. 'USA' should be as a one word. cannot be separated.

myList=[u'USA',u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery',
       u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building',
       u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote',
       u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'horseRidingDiscipline']

How to edit the element in the list.
I would like to get change the element in the list as below shows:

 updatemyList=[u'USA',u'Chancellor', u'current Rank', u'geoloc Department', u'population Urban', u'apparent Magnitude', u'Train', u'artery',
           u'education', u'right Child', u'fuel', u'Synagogue', u'Abbey', u'Research Project', u'language Family', u'building',
           u'Snooker Player', u'production Company', u'sibling', u'oclc', u'notable Student', u'total Cargo', u'Ambassador', u'copilote',
           u'code Book', u'Voice Actor', u'Nuclear Power Station', u'Chess Player', u'runway Length',  u'horse Riding Discipline']

the word is able to separate

The word "u'managerYearsEndYear'" is missing from the second list. Oversight? — Ukimiku, Oct 24 '16 at 09:12
Once again, nothing to do with `nltk` ;P But out of curiosity, for your list, are all words delimited by capital latter? What happens when you have `u'USA'`? Should the output be `u' U S A'` or `u'USA'`? — alvas, Oct 24 '16 at 09:13
Also, see http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-snake-case — alvas, Oct 24 '16 at 09:14
And http://stackoverflow.com/questions/29916065/how-to-do-camelcase-split-in-python — alvas, Oct 24 '16 at 09:15
@alvas, u are right. as my words are not all delimited by capital latter. the word list would consist words such as 'USA' , I am not sure how to do that. 'USA' should be as a one word. cannot be separated. — bob90937, Oct 24 '16 at 09:17
Then it's not an easy problem and its not `camel-case`. You would need a character language model, see https://github.com/karpathy/char-rnn, have fun! — alvas, Oct 24 '16 at 09:20

score 0 · Accepted Answer · edited May 23 '17 at 12:33

0

You could use re.sub

import re 

first_cap_re = re.compile('(.)([A-Z][a-z]+)')
all_cap_re = re.compile('([a-z0-9])([A-Z])')


def convert(word):
    s1 = first_cap_re.sub(r'\1 \2', word)
    return all_cap_re.sub(r'\1 \2', s1)


updated_words = [convert(word) for word in myList]

Adapated from: Elegant Python function to convert CamelCase to snake_case?

edited May 23 '17 at 12:33

Community

1
1

answered Oct 24 '16 at 09:11

Jack Evans

1,697
3
17
33

Sven Marnach · Answer 2 · 2016-10-24T09:21:26.100

0

You can use a regular expression to prepend each upper-case letter that's not at the beginning of a word with a space:

re.sub(r"(?!\b)(?=[A-Z])", " ", your_string)

The bit in the first pair of parens means "not at the beginning of a word", and the bit in the second pair of parens means "followed by an uppercase letter". The regular expression matches the empty string at places where these two conditions hold, and replaces the empty string with a space, i.e. it inserts a space at these positions.

edited Oct 24 '16 at 09:21

answered Oct 24 '16 at 09:12

Sven Marnach

574,206
118
941
841

It works for some element. However, when I write 'USA' the result is ' U S A' which was not what i want – bob90937 Oct 24 '16 at 09:53
Then you will have to specify how exactly the words should be split up. What should happen to `USAToday` and `USAtoday`, and how should a computer detect that? – Sven Marnach Oct 24 '16 at 10:56

score 0 · Answer 3 · answered Oct 24 '16 at 09:19

0

Could do this using regex, but easier to comprehend with a small algorithm (ignoring corner cases like abbreviations e.g NLTK)

def split_camel_case(string):
    new_words = []
    current_word = ""
    for char in string:
        if char.isupper() and current_word:
            new_words.append(current_word)
            current_word = ""
        current_word += char
    return " ".join(new_words + [current_word])


old_words = ["HelloWorld", "MontyPython"]
new_words = [split_camel_case(string) for string in old_words]
print(new_words)

answered Oct 24 '16 at 09:19

Roger Thomas

822
1
7
17

old_words =[ u'Telecommunicationsfirms', u'KKKKKKKKKK', u'tattoo', u'EducationalInstitution'] However, the result is [u'Telecommunicationsfirms', u'K K K K K K K K K K', u'tattoo', u'Educational Institution'] – bob90937 Oct 24 '16 at 09:45
@bob90937 to split 'Telecommunicationsfirms' into 'Telecommunications firms' Is beyond the scope of the original question, – Roger Thomas Oct 24 '16 at 09:57

score 0 · Answer 4 · answered Oct 24 '16 at 09:25

The following code snippet separates the words as you want:

myList=[u'Chancellor', u'currentRank', u'geolocDepartment', u'populationUrban', u'apparentMagnitude', u'Train', u'artery', u'education', u'rightChild', u'fuel', u'Synagogue', u'Abbey', u'ResearchProject', u'languageFamily', u'building', u'SnookerPlayer', u'productionCompany', u'sibling', u'oclc', u'notableStudent', u'totalCargo', u'Ambassador', u'copilote', u'codeBook', u'VoiceActor', u'NuclearPowerStation', u'ChessPlayer', u'runwayLength', u'managerYearsEndYear', 'horseRidingDiscipline']

updatemyList = []


for word in myList:
    phrase = word[0]

    for letter in word[1:]:
        if letter.isupper():
           phrase += " "
        phrase += letter

    updatemyList.append(phrase)

print updatemyList

it works for some element. However, when the word is like old_words =[ u'Telecommunicationsfirms', u'KKKKKKKKKK', u'tattoo', u'EducationalInstitution'] However, the result is [u'Telecommunicationsfirms', u'K K K K K K K K K K', u'tattoo', u'Educational Institution'] — bob90937, Oct 24 '16 at 09:47
To quote Roger Thomas above, "@bob90937 to split 'Telecommunicationsfirms' into 'Telecommunications firms' Is beyond the scope of the original question" — Ukimiku, Oct 24 '16 at 10:15

David Andrei Ned · Answer 5 · 2016-10-24T10:15:56.100

Can you simply do a check to see if all letters in word are caps, and if so, to ignore them i.e. count them as a single word?

I've used similar code in the past, and it looks a bit hard-coded but it does the job right (in my case I wanted to capture abbreviations up to 4 letters long)

def CapsSumsAbbv():
for word in words:
        for i,l in enumerate(word):
            try:
                if word[i] == word[i].upper() and word[i+1] == word[i+1].upper() and word[i+2] == word[i+2].upper() and word[i+3] == word[i+3].upper():
                    try:
                        word = int(word)
                    except:
                        if word not in allcaps:
                            allcaps.append(word)
            except:
                pass

To further expand, if you had entries such as u'USAMilitarySpending' you can adapt the above code so that if there are more than two Caps letters in a row, but there are also lower caps, the space is added between the last and last-1 caps letter so it becomes u'USA Military Spending'

How to separate a irregularly cased string to get the words? - Python

5 Answers5