-4

I'm preparing a database analysis from the website:

 https://www.kaggle.com/c/predicting-loan-default/data 

My variable emp_length takes about 3000 different values. Some values are the same or have the same keyword (for example account, accountant, accounting, account specialist, acct.). Some words contain errors or are shortcuts. I want to decrease the values to simplify the names and encode as numeric values. I tried to find keywords with text mining in R, but I'm not convinced that this is the right way. Does anyone have any idea for this?

anka0501
  • 3
  • 3

1 Answers1

0

Try to adapt this "data science" approach:

Example input data:

emp_length<-c("account","accountant","accounting","account specialist","Data Scientist","Data Science Expert")

String distance + clustering

cluster<-kmeans(stringdistmatrix(emp_length,emp_length,method="jw"),centers=2)
cluster_n<-cluster$cluster

A possible grouping of the labels

cbind(emp_length,cluster_n)
     emp_length            cluster_n
[1,] "account"             "2"      
[2,] "accountant"          "2"      
[3,] "accounting"          "2"      
[4,] "account specialist"  "2"      
[5,] "Data Scientist"      "1"      
[6,] "Data Science Expert" "1" 

This could help in the detection of the label to group and convert in numeric format.

Terru_theTerror
  • 4,918
  • 2
  • 20
  • 39
  • Ok, thank you very much. I have one question, how to choose centers when I don' know how many levels I have? – anka0501 Apr 12 '18 at 19:15
  • You can apply one or more thecniques described in this topic https://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters over the stringdistmatrix to find the optimal number of clusters – Terru_theTerror Apr 13 '18 at 08:32