1

I have a csv file, with three columns: Id, Main_user and Users. Id is the label and both other values as features. Now I want to load the two features (main_user and users) from the csv, vectorize them and assemble them as one vector. After using HashingTF as described in the documentation, how do I add a second feature "Main_user", in addition to the feature "Users".

DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);
Tokenizer tokenizer = new Tokenizer().setInputCol("Users").setOutputCol("words");        
DataFrame wordsData = tokenizer.transform(df);
int numFeatures = 20;
HashingTF hashingTF = new HashingTF().setInputCol("words")
                .setOutputCol("rawFeatures").setNumFeatures(numFeatures);
zero323
  • 322,348
  • 103
  • 959
  • 935
Sparkan
  • 139
  • 1
  • 13

1 Answers1

4

ok I found a solution. Load the columns one after another, tokenize, hashTF and at the end assemble them. I would appreciate any improvement to this.

DataFrame df = (new CsvParser()).withUseHeader(true).csvFile(sqlContext, csvFile);

Tokenizer tokenizer = new Tokenizer();
HashingTF hashingTF = new HashingTF();
int numFeatures = 35;

tokenizer.setInputCol("Users")
        .setOutputCol("Users_words");
DataFrame df1 = tokenizer.transform(df);
hashingTF.setInputCol("Users_words")
        .setOutputCol("rawUsers").setNumFeatures(numFeatures);
DataFrame featurizedData1 = hashingTF.transform(df1);

tokenizer.setInputCol("Main_user")
        .setOutputCol("Main_user_words");
DataFrame df2 = tokenizer.transform(featurizedData1);          
hashingTF.setInputCol("Main_user_words")
        .setOutputCol("rawMain_user").setNumFeatures(numFeatures);
DataFrame featurizedData2 = hashingTF.transform(df2);             

// Now Assemble Vectors
VectorAssembler assembler = new VectorAssembler()
        .setInputCols(new String[]{"rawUsers", "rawMain_user"})
        .setOutputCol("assembeledVector");

DataFrame assembledFeatures = assembler.transform(featurizedData2);
Ivelin
  • 12,293
  • 5
  • 37
  • 35
Sparkan
  • 139
  • 1
  • 13