I'm a part of a group working on a big data course project and we've run into what we see as a problem for NLP. Currently we have groups of data formatted in JSON as such:
"wine": {
"category": "socializing",
"category_id": 31,
"score": 0.0,
"topic_id": 611
}
"dragons": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 2137
},
"furry-fandom": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 48595
},
"legendarycreatures": {
"category": "lifestyle",
"category_id": 17,
"score": 0.279108277990115,
"topic_id": 10523
}
The tags are topics associated with relevant info (a category, popularity score, and a category/topic ID #). We have associated categories for each topic already since the API we're pulling from handles it. Our problem though is that the categories are too broad, with only 33, for identifying any meaningful trends and the topics are too specific w/overlap (e.g. dragons/legendarycreatures) and there are too many with approx 22,000.
This is where NLP comes in; we want to create some sort of set of super-topics that aren't as broad as "category" but not as specific as the current topics. An example using "dragons" and "legandarycreatures" again would be both, along with others, fitting into a super-topic of "fantasy".
A little more background, we're using Python to grab/process our data, we'd like to continue to use it for this, and none of us have any practical experience with NLP.
With this all in mind, we'd love to have some suggestions and help with this area of struggle. If there are better ways or maybe it isn't feasible with NLP, we are open to them. What we're trying to avoid though is hard coding some sort of table for categorization.
TL;DR: We're trying to categorize a set of 22,000 topics to appropriate "super-topics" that are more specific than the current ones but less broad than the current categories. We're trying to do this with NLP while using Python but don't know how to go about it and are open to suggestions as well.