do dplyr mutate support runif

Question

I want to generate normally distributed random numbers as a column using mutate. I tried using runif() but it throws error on a large scale data.

extract_grp <- extract_grp %>%
mutate(rand = runif(sdf_nrow(extract_grp)))
glimpse(extract_grp)

The error that am getting is:

Error: org.apache.spark.sql.AnalysisException: Undefined function: 'RUNIF'. This function is neither a registered temporary function nor a permanent function registered in the database 'temp_data'.; line 1 pos 101 at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:999) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:202) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:174) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:897)

Including a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question will increase your chances of getting an answer. — Samuel, Oct 04 '17 at 15:00
Thanks jsb for the response.But I doubt if it works on dplyr and spark data frame. More over I want to generate normally distributed random numbers. I did not find any helpful content for me or I did not completely understand what you are trying to mention. please explain if I need a correction. — Anil Kumar, Oct 04 '17 at 15:12
If your question consists of two different questions, please separate them and ask two questions instead of one nested question. — Samuel, Oct 04 '17 at 15:13
If you are using `dplyr` (and `sparklyr`?) to connect to a Spark cluster, you should mention that in your question. The problem isn't that your data is large-scale, the problem is that it is stored in Spark and dplyr doesn't know how to translate `runif` to a Spark command. — Gregor Thomas, Oct 04 '17 at 16:21
apologies for that.Thanks Gregor for taking up my concern. yes you are right its the problem with spark and dplyr. Is there any alternative for this. Thanks in advance. — Anil Kumar, Oct 05 '17 at 07:06
rand() solved my issue to an extent. I can able to generate random sequence for my hive table. But what am stuck at is to use seeding. set.seed() works for local R but is does perform on sparklyr. i.e on R Spark Hive cluster. Any alternative you can think of can be helpful for me. — Anil Kumar, Oct 06 '17 at 07:40

score 1 · Answer 1 · answered Oct 06 '17 at 07:40

rand() solved my issue to an extent.

extract_grp <- extract_grp %>%
    mutate(rand = rand())
    glimpse(extract_grp)

I can able to generate random sequence for my hive table. But what am stuck at is to use seeding. set.seed() works for local R but is does perform on sparklyr.

do dplyr mutate support runif

1 Answers1