Natural Language Processing of Topics

Question

I'm a part of a group working on a big data course project and we've run into what we see as a problem for NLP. Currently we have groups of data formatted in JSON as such:

    "wine": {
        "category": "socializing",
        "category_id": 31,
        "score": 0.0,
        "topic_id": 611
    }
    "dragons": {
        "category": "lifestyle",
        "category_id": 17,
        "score": 0.279108277990115,
        "topic_id": 2137
    },
    "furry-fandom": {
        "category": "lifestyle",
        "category_id": 17,
        "score": 0.279108277990115,
        "topic_id": 48595
    },
    "legendarycreatures": {
        "category": "lifestyle",
        "category_id": 17,
        "score": 0.279108277990115,
        "topic_id": 10523
    }

The tags are topics associated with relevant info (a category, popularity score, and a category/topic ID #). We have associated categories for each topic already since the API we're pulling from handles it. Our problem though is that the categories are too broad, with only 33, for identifying any meaningful trends and the topics are too specific w/overlap (e.g. dragons/legendarycreatures) and there are too many with approx 22,000.

This is where NLP comes in; we want to create some sort of set of super-topics that aren't as broad as "category" but not as specific as the current topics. An example using "dragons" and "legandarycreatures" again would be both, along with others, fitting into a super-topic of "fantasy".

A little more background, we're using Python to grab/process our data, we'd like to continue to use it for this, and none of us have any practical experience with NLP.

With this all in mind, we'd love to have some suggestions and help with this area of struggle. If there are better ways or maybe it isn't feasible with NLP, we are open to them. What we're trying to avoid though is hard coding some sort of table for categorization.

TL;DR: We're trying to categorize a set of 22,000 topics to appropriate "super-topics" that are more specific than the current ones but less broad than the current categories. We're trying to do this with NLP while using Python but don't know how to go about it and are open to suggestions as well.

Take a look at http://stackoverflow.com/a/22905260/610569 – alvas Apr 13 '16 at 01:43 — alvas, Apr 13 '16 at 01:43

score 1 · Answer 1 · answered Apr 11 '16 at 00:19

This is a typical classification problem. If you want to use Python, I would recommend you to use the Natural Language ToolKit (NLTK), especially the nltk.classify package. A good introduction and overview of using NLTK for classification can be found here: http://www.nltk.org/book/ch06.html. To get more info on nltk.classify:

>>> import nltk
>>> help(nltk.classify)

loretoparisi · Accepted Answer · 2016-04-13T21:27:24.977

I will suggest TextBlob, since it simplify the process to train the classifier. See the tutorial here about how to build the text classifier. Of course in your specific problem, you need to find out how many different categories you want to classify; you have then to train submitting a significant training set (not too much to avoid over fitting the dataset); at that point your classifier will be ready to get new data of type

"dragons": {
 "category": "lifestyle",
 "category_id": 17,
 "score": 0.279108277990115,
 "topic_id": 2137
 }

and classify it. At that point you have to evaluate your classification against a test dataset. This is not so obvious as it seems by the way looking at this mini dataset (could you provide a bigger one it would kelp), it seems that you have some clusters of data like:

first cluster tagged as lifestyle

"dragons": {
    "category": "lifestyle",
    "category_id": 17,
    "score": 0.279108277990115,
    "topic_id": 2137
},
"furry-fandom": {
    "category": "lifestyle",
    "category_id": 17,
    "score": 0.279108277990115,
    "topic_id": 48595
},
"legendarycreatures": {
    "category": "lifestyle",
    "category_id": 17,
    "score": 0.279108277990115,
  }

second cluster tagged socializing

"wine": {
        "category": "socializing",
        "category_id": 31,
        "score": 0.0,
        "topic_id": 611
    }

To define you super category, you have to tell the classifier that terms like dragons and legendarycreatures belongs to the same dataset, let's call this fantasy. So this is not only a matter or classification, but of text analysis and semantics as well: legendarycreatures => legendary + creatures (bag of words) has a distance to the term dragons that is more closer than other words, so word2vec could help here to evaluated vectors of those names and to define the metrics behind them and the distance between them. A good implementation is provided by gensim.

I'm mentioning word2vec since it will work if you have the text / description for each of those entries or not. In the last case you can just define a metrics for the title of the item like dragons or legendarycreatures.

[UPDATE] So, I'm trying to figure out how to find the right classification algorithm using a brand new technique "that automatically creates and optimizes machine learning pipelines using genetic programming", named Tpot made by @rhiever

In this case, the tool needs the features vectors (from word2vec) as input, that must be provided in the supervised data set format. Here is the discussion, that is a good starting point.

Here are couple different data sets I have: (http://txt.do/5wvo8) (http://txt.do/5wvj6) These are the smallest that we have; with other's reaching close to GBs in size and much more variety. — Austin Holler, Apr 11 '16 at 01:26
@AustinHoller I have just updated this thread with further analysis, there is a good starting point now. — loretoparisi, Apr 13 '16 at 21:28

Natural Language Processing of Topics

2 Answers2