9

I have a SparkSQL DataFrame.

Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?

In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.

Thanks.

zero323
  • 322,348
  • 103
  • 959
  • 935
Ole Petersen
  • 670
  • 9
  • 21

2 Answers2

14

SparkR Column provides a long list of useful methods including isNull and isNotNull:

> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)

  Id Age
1  1  21
2  2  18
3  3  NA

> filter(people, isNotNull(people$Age)) %>% head()
  Id Age
1  1  21
2  2  18
3  3  30

> filter(people, isNull(people$Age)) %>% head()
  Id Age
1  4  NA

Please keep in mind that there is no distinction between NA and NaN in SparkR.

If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

> fillna(people, 99) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30
4  4  99

> dropna(people) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30

Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

> people_with_names_local <- data.frame(
    Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
4  4  NA  <NA>

> dropna(people_with_names, minNonNulls=2) %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
zero323
  • 322,348
  • 103
  • 959
  • 935
2

It is not the nicest workaround, but if you cast them as strings, they are stored as "NaN" and then you can filter them, a short example:

testFrame   <- createDataFrame(sqlContext, data.frame(a=c(1,2,3),b=c(1,NA,3)))
testFrame$c <- cast(testFrame$b,"string")

resultFrame <- collect(filter(testFrame, testFrame$c!="NaN"))
resultFrame$c <- NULL

This omits the entire row where the element in column b is missing.

Wannes Rosiers
  • 1,680
  • 1
  • 12
  • 18
  • It would require separate code for different types and will fail completely with booleans: `df <- createDataFrame(sqlContext, data.frame(x=c(FALSE, NA, TRUE))); df$chx <- cast(df$x,"string"); df %>% head()` – zero323 Jul 25 '15 at 14:56
  • Well the problem is that the entry is empty. Let me give an example: ID = 1 2 3. AGE = 21 EMPTY 20. So enry 2th in this data is empty but in sparkR it does not have the 'value' NAN, NA or NULL. The idea is maybe to give this empty entry a value? – Ole Petersen Jul 26 '15 at 11:34
  • Could you give an initialization example? Just to understand if you have data.frame(ID=c(1,2,3),AGE=c(21,"EMPTY",20)) or data.frame(ID=c(1,2,3),AGE=c(21,,20)) or still something else. – Wannes Rosiers Jul 27 '15 at 06:40
  • I have a dataset from a csv-file. I call it 'pgz'. When I type pgz in sparkR I get "DataFrame[id:string, age:string]" as output. Lets look at first data. ID = 1 2 3. Lets say AGE for ID=1 is 21 and AGE for ID=3 is 20. AGE for ID=2 is empty. This gives me a problem in sparkR when I will run a function on the data. Example: Take the AGE for each ID and add 1 to it. This works fine for ID=1 and ID=3 but it crashes for ID=2 because this ID is empty. Therefore I wish to delete this ID. – Ole Petersen Jul 27 '15 at 07:54