Split R dataset containing a column with 3 string values into 2 datasets containing 2 string values

Question

Consider me a n00b but I have searched my specific query here and I haven't found the answer yet. My problem is as follows. Consider the following simplified csv file r_split.csv which represents my dataset:

id,v1,v2,v3,v4,str
1,2.4,2.4,345.5,234.2,gbbc
2,4.5,2.56,7.45,34.6,ebird
3,3.4,5.6,4.45,6.3,ebird_can

The first row contains the header names. You can see that the column str contains 3 different string values i.e. gbbc, ebird, ebird_can. My objective is to split this big dataset into 2 datasets. The first one will only contain all the str values = gbbc and the second one will contain all the str values of ebird and ebird_can renamed as allebird.

I can split the dataset into 3 distinct datasets by using the following command:

splitted<-split(rsplit,rsplit$str)

However, I cannot figure out how to use 2 distinct values of the str column and combine them into the third. Can someone help me out please?

Thanks.

score 2 · Accepted Answer · edited May 23 '17 at 10:28

First, make sure str column is not a factor. Use stringsAsFactors = FALSE option within read.csv(.) to load all strings as characters and not as factors.

Second, it's fine to use subset during an interactive session. However, as this post (or direct link to hadley's wiki nicely explains, it is not wise to use it within your functions.

I'd recommend direct subsetting with [.

df1 <- df[df$str == "gbbc", ]
df2 <- df[df$str != "gbbc", ]
df2$str <- "allebird"
> df1
#   id  v1  v2    v3    v4  str
# 1  1 2.4 2.4 345.5 234.2 gbbc
> df2
#   id  v1   v2   v3   v4      str
# 2  2 4.5 2.56 7.45 34.6 allebird
# 3  3 3.4 5.60 4.45  6.3 allebird

Alternatively, if there are just two values "gbbc" and "allebird", then you can first replace everything else except "gbbc" with "allebird" and then, as you mention, use split.

df3 <- df
df3$str[df3$str != "gbbc"] <- "allebird"
split(df3, df3$str)
# $allebird
#   id  v1   v2   v3   v4      str
# 2  2 4.5 2.56 7.45 34.6 allebird
# 3  3 3.4 5.60 4.45  6.3 allebird
# 
# $gbbc
#   id  v1  v2    v3    v4  str
# 1  1 2.4 2.4 345.5 234.2 gbbc

Thank you for the excellently explained answer ! I tried it out on my data. It works perfectly. — Shion, Mar 18 '13 at 21:19

score 2 · Answer 2 · answered Mar 18 '13 at 20:50

You can use the levels function to change and merge levels of a factor. For your case (assuming that str is already a factor with the default ordering of levels) you could do something like:

levels(rsplit$str) <- c('allebird','allebird','gbbc')
splitted<-split(rsplit,rsplit$str)

You may want to make a copy of rsplit first and modify the copy rather than the original (if you want to keep the original with the original levels).

For a more complicated example you can use tools like grep, gsub, or the gsubfn package to create the new vector of factor levels.

Split R dataset containing a column with 3 string values into 2 datasets containing 2 string values

2 Answers2