Questions tagged [gapply]

6 questions
3
votes
0 answers

Databricks - Dplyr on SparkDataframe

I am looking to run dplyr functions on a Spark dataframe. How do I run dplyr functions on a Spark dataframe through Databricks? No matter how I modify my code it always has the same error with a different dplyr function. HDEF_df_test is a Spark…
nak5120
  • 4,089
  • 4
  • 35
  • 94
2
votes
1 answer

SparkR gapply - function returns a multi-row R dataframe

Let's say I want to execute something as follows: library(SparkR) ... df = spark.read.parquet() df.gapply( df, df$column1, function(key, x) { return(data.frame(x, newcol1=f1(x), newcol2=f2(x)) } ) where the…
Matt Anthony
  • 121
  • 8
1
vote
1 answer

Bizdays doesn't exclude weekends

I am trying to calculate utilization rates by relative employee lifespans. I need to assign a total number of hours available to this employee between the earliest and furthest date in which time was recorded. From there I will use this as the…
Rory
  • 95
  • 1
  • 5
1
vote
0 answers

gapply sometimes returning duplicated groups?

I'm running some code, the relevant essence of which is: library(SparkR) library(magrittr) sqlContext %>% sql("select * from tmp") %>% gapply("id", function(key, x) { data.frame( id = key, n = nrow(x) ) }, schema =…
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
0
votes
1 answer

Number of weekdays between two dates applied to groups in a grouped dataframe

I am trying to use gapply on a grouped df to get a timeline for time entry on projects. Below I want to get a column that will have available working time for a person based on working hours between the earliest date they booked time and the latest…
Rory
  • 95
  • 1
  • 5
0
votes
1 answer

declaring output schema when using gapply in Sparkr

I would like to use gapply according to https://spark.apache.org/docs/latest/sparkr.html#gapply The problem is I am returning a list of 2 dataframes. return(list(df1, df2)) How do I declare the output schema in this case?
bhomass
  • 3,414
  • 8
  • 45
  • 75