How to handle slang words and short forms in Tweets like luv , kool and brb?

Question

I am doing preprocessing of tweets using Python. However, a lot of words used are short forms of other words like luv, kool etc. And also, abbreviations like brb , ttyl etc.

Right now, I can only think of having a huge Hashmap with words as keys and the actual words or expansions as values. Is there any other better way to approach this using NLP ?

NOTE : I know question seems too vague. But please dont report it. I have asked this so that amateurs can benefit from this knowledge

PS : Is there a nicely formatted text list that I can download and use? The links put down are good , but when i copy and paste it - they are not in an easily parsable format

score 3 · Accepted Answer · edited May 23 '17 at 12:01

3

The only way to decipher abbreviations is to use external resources. That is why there are many dictionaries of abbreviations for humans. Although, humans can predict meaning by using common-sense knowledge and already known abbreviation, but even they do it badly, so no hope for NLP at this time.

Sometimes it is also possible to find definitions of abbreviations in the same text, but it is not the case for twitter or (not and) slang.

So, yes, you have to store mapping from acronyms to their expansions. In order to obtain them, search for acronyms dictionary, e.g. this slang dictionary, or that, or that, or that - seems to be the easiest for parsing.

As for other slang like 'kool', you can try spell correction algorithms, see related question.

edited May 23 '17 at 12:01

Community

1
1

answered Feb 27 '15 at 16:55

Nikita Astrakhantsev

4,701
1
15
26

Is there a nicely formatted text list that I can download and use? – GokuShanth Feb 28 '15 at 09:18
As I said, the last one can be easily parsed: you just need to open its page's source code in any browser, copy-paste needed fragment to the text file, and finally keep everything inside
-tags (or take a line with acronym, skip 2 lines, take a line with definition, skip 4 lines and repeat). Even regexp aren't needed there, so it is indeed nicely formatted text.
– Nikita Astrakhantsev Feb 28 '15 at 10:35

How to handle slang words and short forms in Tweets like luv , kool and brb?

1 Answers1

Linked