3

From R, we have the function factors(). I would like to use this function in a parallelize way, with Spark R.

My version of Spark is 1.6.2, and I cannot find an equivalent in the documentation. I thought I could do it with a map, but I am not certain I understand this answer, and there should be an easier way.

So to put it simply: What is the equivalent of factors() in Spark R ?

Community
  • 1
  • 1
Béatrice Moissinac
  • 934
  • 2
  • 16
  • 41
  • Is [this answer](http://stackoverflow.com/questions/25038294/how-do-i-run-the-spark-decision-tree-with-a-categorical-feature-set-using-scala) helpful? – Andrew Taylor Jul 19 '16 at 18:13
  • 1
    Or using Spark's [one-hot encoder](http://stackoverflow.com/questions/32277576/spark-ml-categorical-features) to create dummy variables – Andrew Taylor Jul 19 '16 at 18:18
  • I understand from the 2nd answer that it is pointing toward a map-like solution, but it is not clear to me how to achieve this in SparkR. I am not sure how I would have the map function returns 3 columns in SparkR. – Béatrice Moissinac Jul 19 '16 at 18:18
  • OneHotEncoder seems like the way to go - From the documentation, it is available only in Scala, Java, and Python. So a solution to my problem is to prep my data in Scala, and then load it in R. – Béatrice Moissinac Jul 19 '16 at 18:22
  • I will say I recommend switching over to rstudio's sparklyr instead of sparks sparkR. I've found it more intuitive and reliable. From there, there seems to be a way to inject 'raw scala'. Or just one-hot encode your variable manually through the sparklyr connection – Andrew Taylor Jul 19 '16 at 18:25
  • Thank you, I never heard of SparkLyr and quickly going through their tutorial, it looks great! Quick follow-up question, do you know if it easily installed on AWS? – Béatrice Moissinac Jul 19 '16 at 18:32
  • How do you "inject" scala with sparklyr? – Béatrice Moissinac Jul 19 '16 at 18:56

2 Answers2

2

There is no direct equivalent. Spark encodes every type of variable using double precision numbers and uses metadata to distinguish between different types. For ML algorithms you can use formulas which automatically encode columns.

  • Actually, what I did was to use a SQL query like this one: http://stackoverflow.com/questions/13309947/categorizing-data-based-on-the-value-of-a-field :) – Béatrice Moissinac Jul 20 '16 at 15:35
0

There are 2 ways of converting categorical variables -

  1. StringIndexer(): This will convert string values to numeric and you can get back the original values using IndextoString(). StringIndexer is an Estimator so we need to use fit() and then transform() to get the converted values.

  2. Use OneHotEncoder(): This will convert the categories into sparse vector. You can control whether to drop last categoru or not by setting DropLast to false. This is a Transformer, hence tranform() is sufficient.

Refer this link for more details: http://spark.apache.org/docs/latest/ml-features.html#stringindexer

Aanish
  • 49
  • 5