1

I am a newbie to NLP and related technologies. I have been researching on decomposing folksonomies such as, hashtags into individual terms (ex:- #harrypotterworld as harry potter world) in order to carry out Named-Entity Recognition.

But I did not come across any available library or previous work I could use for this. Is this achievable or am I following a wrong procedure? If so, are there any available libraries or algorithmic techniques I could use?

Chiranga Alwis
  • 1,049
  • 1
  • 25
  • 47

2 Answers2

2

What you are looking for is a compound splitter. As far as I know, this is a problem that does have some implementations, some of which work reasonably well.

Unfortunately most research I know of has been done on languages that tend to compound nouns (ie. German). Fun fact: Hashtag is a compound word itself.

I once used this one: http://ilps.science.uva.nl/resources/compound-splitter-nl/ It is an algorithm that works on Dutch. It basically uses a dictionary of uncompounded words an assumes a very uncomplicated grammar for compounding: Something along the lines of: Infixes such as n and s are allowed, and compounded words are always a combination of 2 or more uncompounded words from the dictionary.

I think you could use the given implementation for compounded hashtags, if you provided an English dictionary, and adapted the assumed grammar somewhat (You might not want infixes).

S van Balen
  • 288
  • 2
  • 11
  • First of all thanks for the clear answer given. So this effectively means that I am to implement my own compound splitter for hashtags probably using the logic (slightly different) of the implementation you had suggested? – Chiranga Alwis Dec 28 '16 at 12:07
  • 1
    Yes, that is what I would do. Depending on how good of a job it must do, this can be a project that takes you 1 or 2 hours. – S van Balen Dec 28 '16 at 12:34
1

Have you tried the method suggested here?

https://stackoverflow.com/a/11642687/7337349

The issue is that the dictionary of words has to contain so called proper nouns to work really well for named entity recognition, which theoretically makes it a very large dictionary. (plus the frequency distribution is probably hard to measure)

Incidentally, for the specific example you mentioned - harry potter world, I think the answer in that link would work - all the words are present in the dictionary of words which was linked in the answer.

Community
  • 1
  • 1
Aravind M
  • 116
  • 5