0

I have a large dataframe (just over 8,500,000 cells in total) and I need to create some subsets of this dataframe based on the values in a specific column.

I am aware that I can create said subsets by hand and am happy doing this when there are only a few values. At present, I obtain the unique values:

table(df$ColumnX)

and then construct the individual dataframes from there as there are only a few values:

df.subset1 <- df[df$ColumnX == "Subset1", ]
df.subset2 <- df[df$ColumnX == "Subset2", ]
...
df.subsetX <- df[df$ColumnX == "SubsetX", ]

But when there are significantly more unique values is where I see a problem which would require my computer's processing power to achieve my goal in a timely manner.

What I want to know is if this process can be automated.

Something like this is what I am hoping to achieve:

- List values in Column X
- Create a new dataframe/subset for each value in Column X

Or:

for(all unique values in Column X)
    create a new dataframe
end for

Therefore, I would have something like this based on the values of ColumnX:

df.subset1
df.subset2
...
df.subsetX
Mus
  • 7,290
  • 24
  • 86
  • 130
  • Just use `split` i.e. `dflist <- split(df, df$ColumnX)`. This will give you a list of data.frames – talat Sep 01 '17 at 08:34
  • I see. And how do I extract the data.frames from the list? – Mus Sep 01 '17 at 08:38
  • If you take the example of `x <- split(iris, iris$Species)`, you can extract the list elements using either `x$setosa` or `x[[1]]` or `x[["setosa"]]` – talat Sep 01 '17 at 08:44
  • Some additional context is available here: https://stackoverflow.com/a/24376207/3521006 – talat Sep 01 '17 at 08:48
  • Dont extract them at all. Keeping them in a list is a much better option and will streamline your further analysis. – talat Sep 01 '17 at 10:19

2 Answers2

1

Sample Dataset:

zz <- "A1   A2   A3   A4   A5
Z    Z    1    10   12
E    Y    10   12    8
D    X    2    12   15
Z    Z    1    10   12
D    X    2    14   16"
df <- read.table(text=zz, header = TRUE)

s1 <- split(df, df$A1)
list2env(s1,envir=.GlobalEnv)

The List get stored as dataframe in your environment

> D
  A1 A2 A3 A4 A5
3  D  X  2 12 15
5  D  X  2 14 16
> E
  A1 A2 A3 A4 A5
2  E  Y 10 12  8
> Z
  A1 A2 A3 A4 A5
1  Z  Z  1 10 12
4  Z  Z  1 10 12
Prasanna Nandakumar
  • 4,295
  • 34
  • 63
0

I agree with @docendo that in general, keeping the dataframe in a list is in general more efficient.

But for record sake, you could also use assign:

list_index <- list(1:5, 6:8, 10:13)

for(i in 1:length(list_index)){ # i <- 1
  assign(paste0("df_", i), mtcars[list_index[i][[1]], ])
}
YCR
  • 3,794
  • 3
  • 25
  • 29