Extract entities from folksonomies

Question

I am a newbie to NLP and related technologies. I have been researching on decomposing folksonomies such as, hashtags into individual terms (ex:- #harrypotterworld as harry potter world) in order to carry out Named-Entity Recognition.

But I did not come across any available library or previous work I could use for this. Is this achievable or am I following a wrong procedure? If so, are there any available libraries or algorithmic techniques I could use?

score 2 · Accepted Answer · answered Dec 28 '16 at 09:08

What you are looking for is a compound splitter. As far as I know, this is a problem that does have some implementations, some of which work reasonably well.

Unfortunately most research I know of has been done on languages that tend to compound nouns (ie. German). Fun fact: Hashtag is a compound word itself.

I once used this one: http://ilps.science.uva.nl/resources/compound-splitter-nl/ It is an algorithm that works on Dutch. It basically uses a dictionary of uncompounded words an assumes a very uncomplicated grammar for compounding: Something along the lines of: Infixes such as n and s are allowed, and compounded words are always a combination of 2 or more uncompounded words from the dictionary.

I think you could use the given implementation for compounded hashtags, if you provided an English dictionary, and adapted the assumed grammar somewhat (You might not want infixes).

First of all thanks for the clear answer given. So this effectively means that I am to implement my own compound splitter for hashtags probably using the logic (slightly different) of the implementation you had suggested? — Chiranga Alwis, Dec 28 '16 at 12:07
Yes, that is what I would do. Depending on how good of a job it must do, this can be a project that takes you 1 or 2 hours. — S van Balen, Dec 28 '16 at 12:34

score 1 · Answer 2 · edited May 23 '17 at 11:46

Have you tried the method suggested here?

https://stackoverflow.com/a/11642687/7337349

The issue is that the dictionary of words has to contain so called proper nouns to work really well for named entity recognition, which theoretically makes it a very large dictionary. (plus the frequency distribution is probably hard to measure)

Incidentally, for the specific example you mentioned - harry potter world, I think the answer in that link would work - all the words are present in the dictionary of words which was linked in the answer.

Extract entities from folksonomies

2 Answers2