2

I want to generate normally distributed random numbers as a column using mutate. I tried using runif() but it throws error on a large scale data.

extract_grp <- extract_grp %>%
mutate(rand = runif(sdf_nrow(extract_grp)))
glimpse(extract_grp)

The error that am getting is:

Error: org.apache.spark.sql.AnalysisException: Undefined function: 'RUNIF'. This function is neither a registered temporary function nor a permanent function registered in the database 'temp_data'.; line 1 pos 101 at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:999) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:202) at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:174) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:897)

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Anil Kumar
  • 385
  • 2
  • 17
  • 2
    Including a [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) in your question will increase your chances of getting an answer. – Samuel Oct 04 '17 at 15:00
  • Thanks jsb for the response.But I doubt if it works on dplyr and spark data frame. More over I want to generate normally distributed random numbers. I did not find any helpful content for me or I did not completely understand what you are trying to mention. please explain if I need a correction. – Anil Kumar Oct 04 '17 at 15:12
  • If your question consists of two different questions, please separate them and ask two questions instead of one nested question. – Samuel Oct 04 '17 at 15:13
  • 1
    If you are using `dplyr` (and `sparklyr`?) to connect to a Spark cluster, you should mention that in your question. The problem isn't that your data is large-scale, the problem is that it is stored in Spark and dplyr doesn't know how to translate `runif` to a Spark command. – Gregor Thomas Oct 04 '17 at 16:21
  • apologies for that.Thanks Gregor for taking up my concern. yes you are right its the problem with spark and dplyr. Is there any alternative for this. Thanks in advance. – Anil Kumar Oct 05 '17 at 07:06
  • rand() solved my issue to an extent. I can able to generate random sequence for my hive table. But what am stuck at is to use seeding. set.seed() works for local R but is does perform on sparklyr. i.e on R Spark Hive cluster. Any alternative you can think of can be helpful for me. – Anil Kumar Oct 06 '17 at 07:40

1 Answers1

1

rand() solved my issue to an extent.

extract_grp <- extract_grp %>%
    mutate(rand = rand())
    glimpse(extract_grp)

I can able to generate random sequence for my hive table. But what am stuck at is to use seeding. set.seed() works for local R but is does perform on sparklyr.

Anil Kumar
  • 385
  • 2
  • 17