4

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.

I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:

  • "starting tomorrow, we have 5 boxes of @hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "@hersheys"])
  • "Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
  • "#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)

So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.

Some concerns I have are:

  • Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
  • Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
  • Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)

I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.

loesak
  • 1,413
  • 2
  • 19
  • 33
  • Answered a similar question just yesterday. Focus on techniques - not tools. http://stackoverflow.com/questions/30585228/how-to-detect-features-of-a-product-in-an-english-sentence-nlp/30627873#30627873 See if it helps or else I'll write detailed answer. – Aditya Jun 04 '15 at 06:22
  • @AdityaJoshi , thank you. i'll look into this. In the mean time, i found something called Lexical Level Matching (http://cogcomp.cs.illinois.edu/page/demo_view/LLM) and for the most part does the minimum of what i need. It'll take me a while to evaluate this and your suggestion to provide feedback as this is a whole new area for me. – loesak Jun 10 '15 at 19:00

2 Answers2

3

The answer that you chose is not really answering your question.

The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.

If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings. http://www.namedevelopment.com/brand-names.html If you know how to crawl, it's not a hard problem to solve.

aerin
  • 20,607
  • 28
  • 102
  • 140
1

It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.

Misspellings

In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.

  • Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
  • Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
  • Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.

Word-sense disambiguation

Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.

Linguistic aliases (synonyms)

As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.

Conclusion

You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

Community
  • 1
  • 1
errantlinguist
  • 3,658
  • 4
  • 18
  • 41
  • Thank you very much. A very comprehensive answer. Unfortunately i will not be able to get to this for a while to verify if you are correct. Hopefully sooner than later. Thank you very much. – loesak Mar 30 '16 at 19:22