1

I have two columns in my dataframe: Text and category

Sample text: Real text data is much bigger.Two columns are separated by |

Text|Category
I want to get financial advise|financial advise
can I get my loan approved?| loan query
how many years of credit history required?|credit card query

I want to analyze the text column and predict the category. In real data, there are 100s of such categories. What would be the best approach to do this? I am doing this in R language.

James Z
  • 12,209
  • 10
  • 24
  • 44
raza14
  • 31
  • 1
  • 1
    Please read [(1)](http://stackoverflow.com/help/how-to-ask) how do I ask a good question, [(2)](http://stackoverflow.com/help/mcve) How to create a MCVE as well as [(3)](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#answer-5963610) how to provide a minimal reproducible example in R. Then edit and improve your question accordingly. I.e., abstract from your real problem... – Christoph Nov 26 '17 at 10:31

1 Answers1

1

Your task can be splitted on subtasks:

  1. Convert "category" variable values into integer numbers.

  2. Process "text" variable into simple values by using "tidy text" approach.

  3. Apply one of the models for multiclass classification, for example, like this one Multiclass Classification with XGBoost in R

This is very general approach to solve your task.

Andrii
  • 2,843
  • 27
  • 33
  • Will the XGBoost be able to handle 100 levels in the dependent variable?Also, if you convert the levels into numbers like 1,2,3,4 etc, the algorithms like RandomForest will start making predictions with decimal numbers, which in this case will make no sense as the prediction values should be a number denoted to the level. – raza14 Nov 28 '17 at 18:07
  • xgboost can handle 100+ levels. you can code/encode variables values into categorical/number formats. it's standard practice. also will be happy to have "+1" on my answer :) – Andrii Nov 28 '17 at 18:49
  • Andrii, if my dependent variable has 100 levels, with names 'big', 'medium','large','xralarge' etc, so you are suggesting to first assign a digit (numerical data type), to each level then train the XGBoost model to predict these numbers.Which I'm not sure will generate only numbers(without decimal) or numbers with decimals as predictions. What If I train XGBoost directly on these 100 levels without any conversion?As you said, it can deal with many levels,then it should work with my existing 100 levels with no problems. Please correct me if I'm wrong. – raza14 Nov 28 '17 at 20:01
  • I'm a new user, so my upvote to the answer is not getting reflected here. Is that how it works?I want to appreciate the answers pretty much, just not sure how to do . Will figure out very soon. – raza14 Nov 28 '17 at 20:03