-2

I have a large data frame like this :

df:

col_1   col_2  col_3
  1       2       1

and I want to divide it to this subgroups very fast:

df_1:

 col_1   col_3
  1       1

df_2:

  col_2
    2

I know there is a way with which and is like this:

df_1 <- df[df == 1]
df_2 <- df[df == 2]

but it's not fast. what should i do ?

thanks

MMRA
  • 337
  • 1
  • 3
  • 11

3 Answers3

1

An option with dplyr and tidyr:

 df %>% 
  tidyr::gather(key,val) %>% 
  group_split(val)   #attributed to @agila for pointing out the unnenecessary call to group_by that I missed initially
[[1]]
# A tibble: 2 x 2
  key     val
  <chr> <int>
1 col_1     1
2 col_3     1

[[2]]
# A tibble: 1 x 2
  key     val
  <chr> <int>
1 col_2     2

attr(,"ptype")
NelsonGon
  • 13,015
  • 7
  • 27
  • 57
  • I think t should be something like `df %>% gather(key, val) %>% group_split(val) ` – agila Jul 28 '19 at 13:41
  • Still doesn't achieve OP's desired output at least on my machine. – NelsonGon Jul 28 '19 at 13:48
  • @agila Fixed it(or so I think). It seems OP is more interested in grouping by val rather than key which on second look is what you wrote there and I missed. Corrected it, thanks. – NelsonGon Jul 28 '19 at 13:50
1

Welcome to SO.

I'd suggest looking at the dplyr and data.table packages, which are focused on fast and memory efficient implementations. Especially i'd suggest the amazing answers to this question, which will give an good understanding of what these two packages are capable of.

data.table does tend to outperform dplyr as the number of groups and repeated subsets grow, as it utilizes indexed and keyed subsets, but for most it comes down to a preference. Focusing on subsetting I'll provide a reproducible example and some speed comparisons.

Reproducible example

set.seed(1)
df <- data.frame(group = sample(LETTERS, 1e7, TRUE), 
                 random_numbers = rnorm(1e7), 
                 random_binaries = rbinom(1e7, 1, 0.3))
# size = 152.6 MiB
format(object.size(df), units = "MiB")

Methods:

Base-R methods

Now in base-R subsetting can be performed in a myriad ways, one is the one you have shown yourself. df[df == ..]. An alternative is to use the subset function, however this a utility function, and is focused on readability rather than speed, and will usually perform worse. An example of their use is given below. However one may use the which function, to convert a logical vector into indices, and doing this may improve performance.

df[df$group == "C",]
#Equivalent
df[which(df$group == "C"),]
#Equivalent
subset(df, group == "C")

dplyr methods

An alternative is the dplyr package. The dplyr is syntax sugar, giving piping options not unlike a few other packages (for example the magrittr package), but different benchmarks (shown in the first link) show that this package can be used to improve performance on various aspects. However i am no expert on this package, as i tend to use the data.table package. The package provides the %>% piping function and some utility functions such as filter which can be used for subsetting data

library(dplyr)
df %>% filter(group == "C")
# subsetting two columns
df %>% filter(group == "C", random_binaries == TRUE) #Equivalent to group == "C" & random_binaries == TRUE

Data.table methods:

Last another popular package is the data.table package. This package is designed for performance and memory efficiency, like dplyr. The syntax is designed to be similar to SQL statements, (select, from, where, group by), but starting out the syntax can be a bit confusing. The package provides a new data.table class, to be used rather than the data.frame class, which is notoriously slow for subsetting.

However, one can almost completely ignore the syntax of the package, as the data.table utilizes the data.frame syntax in most cases, and can be used as a data.frame in every circumstance.

library(data.table)
#Convert the data.frame to data.table
setDT(df) 

The data.table has two standard methods: Using indices and using keys. Indices are used if one uses similar methods to the data.frame methods:

df1 <- df[random_binaries == TRUE]
df2 <- df[group == "C"]

Indices has roughly the same speed on the first usage but will increase on performance on every subsequent use.

Keys are used to pre-sort the data.table, which allows for smart subsetting. Setting the key does take some time, and has a slightly different syntax, but outperforms other methods (although indices are similar in speed)

#Set the key using either setkey, or setkeyv (multiple columns)
setkeyv(df, c("group", "random_binaries"))
#Subset on group
df[.("C")]
#subset on random_binaries
df[CJ(group, TRUE, unique = TRUE)]
df[.(unique(group), TRUE)]
# Subset on multiple conditions
df[.(c("C", "H"), c(TRUE, TRUE))]

The syntax may be confusing, but one can check out their useful wiki page, or the many stackoverflow posts (8968 as of today), which provide answers to most questions.

Performance comparison

I've checked the performance of the subsetting methods presented, which are visualized below. The visualization, shows the various methods for a subset of group == "C" and group == "H" & random_binaries == TRUE" using the methods illustrated. The x-axis indicates the run time in milliseconds, and the y-axis shows the methods. The widths of the blobs indicates the range, while the size of the blot illustrates the density of times in a range.

From the visualization one can see, that for a dataset of 2 columns, subsetting on both 1 and 2 columns, the data.table method using keys are much much faster (marked as data.table_.._keyed), while using indexes slightly outperforms the remaining methods. Using subset is slower than standard methods, and suprisingly for this illustration dplyr is slower than base-R, however this is might be due to my inexperience with the package.

enter image description here

Community
  • 1
  • 1
Oliver
  • 8,169
  • 3
  • 15
  • 37
0

Here's one way using lapply from base R which gives you a list of your desired dataframes -

df <- data.frame(col_1 = 1, col_2 = 2, col_3 = 1)

lapply(unique(unlist(df)), function(x) {
  df[, df == x, drop = F]
})

# output

[[1]]
  col_1 col_3
1     1     1

[[2]]
  col_2
1     2
Shree
  • 10,835
  • 1
  • 14
  • 36