How to extract meaningful words from messy strings?

Question

consider strings like the following, how can I extract any meaningful words?

ctl00_PD_lblProductTitle ---> 'Product Title' (ct100_PD_lbl is not a complete word, ignore)

prod-detail-info ---> 'Detail Info' (prod is not a complete word, ignore)

prodprice price-onsale ---> 'price on sale' (prod is not a complete word, ignore)

rating-score ---> 'Rating Score'

RatingScore ---> 'Rating Score'

curious to know what technique or process this is called and any libraries if any. could regex be robust enough?

Example two: "prod-detail-info" ---> "Rod Detail Info". I guess you have to define the problem and expected result in more detail. — , Jan 05 '14 at 04:55

Justin O Barber · Accepted Answer · 2014-01-05T12:26:19.393

In short, the examples you provide cannot be assessed in the way you want them to be unless you skew your decision tree or classifier to your particular data. (Consider, for example, numbers 2 and 3. Prod is a word in any English dictionary, but info won't necessarily show up in most English dictionaries.)

If you train your classifier or decision tree on these specific data, only then will you get the result you want. But generally speaking, you might try tokenizing your text to begin with (as @user2314737 has suggested):

>>> import nltk
>>> t = '''ctl00_PD_lblProductTitle

prod-detail-info

prodprice price-onsale

rating-score

RatingScore'''
>>> nltk.tokenize.wordpunct_tokenize(t)
['ctl00_PD_lblProductTitle', 'prod', '-', 'detail', '-', 'info', 'prodprice', 'price', '-', 'onsale', 'rating', '-', 'score', 'RatingScore']

Then you might be able to find further possibile words with regular expressions such as this:

>>> re.findall(r'[A-Z][a-z]{2,}', 'ctl00_PD_lblProductTitle')  # also works for 'RatingScore'
['Product', 'Title']

This regular expression will find all sequences that begin with an upper-case letter and that are then followed by 2 or more lower-case letters. Regarding your comment, no, this regex will not work for unicode. Unfortunately, regular expressions in Python do not presently have the capability to distinguish between upper- and lower-case in the same fashion. In this case, you might need something like this (which I have thrown together fairly quickly without taking the time to make it pretty):

>>> def split_at_uppercase(text):
    result = []
    new_word = []
    for char in text:
        if char.isupper():
            if new_word:
                result.append(''.join(new_word))
            new_word = []
            new_word.append(char)
        elif new_word and char == ' ':  # in more complicate scenarios, may need to use regex for white space
            result.append(''.join(new_word))
            new_word = []
        elif char != ' ':
            new_word.append(char)
    else:
        result.append(''.join(new_word))
    return result

>>> t = 'καὶ τοῦΠιλάτου εἰςἹεροσόλυμα'
>>> split_at_uppercase(t)
['καὶ', 'τοῦ', 'Πιλάτου', 'εἰς', 'Ἱεροσόλυμα']

Then, as @acarlon has suggested, you will start needing to check substrings against a dictionary (such as PyEnchant). But even then, you will find 'meaning' where no one may have meant to impart it. And further, you will discover words that do not interest you (such as prod).

I used a hand-made very simple Markov chain for a problem like this. I'm curious, what specific classifier would you use for this problem? — dsign, Jan 05 '14 at 05:18
@dsign A Markov chain could work nicely here, since the words (or events) appear to be related and are perhaps characteristic of one (or perhaps two) genres, although I guess the success of that model (or any other!) depends on how much data the OP has. But perhaps you are thinking about classifying text on a more microscopic level (which might recognize common morphemes in the English language and present possible combos). Many of the words also seem to be nouns or adjectives, so a pos-tagger might also be a component of a decision tree. I haven't given this sort of problem too much thought. — Justin O Barber, Jan 05 '14 at 05:42
Thanks! The term pos-tagger was new for me, and POS classification can certainly help me. — dsign, Jan 05 '14 at 05:47
what does that regex do exactly? I think that regex solution might be good enough, and then running it through a spell checker, but does it work for other languages like finnish or chinese? — KJW, Jan 05 '14 at 06:30
@Kim Jong Woo See my edits above. I have explained the regex and offered an alternative for languages that use characters that go beyond ascii. — Justin O Barber, Jan 05 '14 at 12:27

score 0 · Answer 2 · edited May 23 '17 at 10:25

I would suggest:

Extract all words (sequences of letters - this will include compound words such as ProductTitle), i.e. strip out everything that is not a letter into groups. Regex something like ([a-zA-Z])+?). +? means non-greedy.
For each word look up the word in an english dictionary database. If there is a match then move on to the next word. If not go to step 3.
If no match, then try to find multiple words in the compound word, e.g. ProductTitle -> Product Title. This is not a trivial task, requires a decision tree type operation and the ability to roll back. Have a look at tries and at this answer.

@Nabla. Yes, that is where step 3 comes in. – acarlon Jan 05 '14 at 05:01 — acarlon, Jan 05 '14 at 05:01

How to extract meaningful words from messy strings?

2 Answers2