0

I have a large list in R. It comprises 20 different dataframes, each with two variables (columns). These variables are the same for each of the dataframes within the large list. I am collapsing the dataframe with the rbindlist function from the data.table package. This works successfully, yielding a single dataframe with all of the observations of the 2 variables. However, I would like to add a third variable for each list (and for the ultimate dataframe) that contains the number of the list that each unit/observation is contained in.

For example: Unit 1 is (1, 235) and is located in the first list. Unit 1 in the new dataframe should now be (1,235,1) in the new dataframe. Unit 2 is (2, 248) and is located in the first list. Its three columns' values should now be (2,248,1) in the new dataframe. Unit 3 is (3,78), but it is located in the second list. So its three columns' values should now be (3,78,2).

Is it possible to preserve the number/placement of the individual unit within the large list when collapsing it into one dataframe?

flâneur
  • 633
  • 2
  • 8
  • Please, provide a minimal reproducible example: [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – PaulS Jul 07 '22 at 18:14

1 Answers1

1

Are you trying to do this?

library(dplyr)

df1 <- data.frame(v1 = 1:5, v2 = sample(50:60, 5))
df2 <- data.frame(v1 = 6:10, v2 = sample(50:60, 5))
df3 <- data.frame(v1 = 11:15, v2 = sample(50:60, 5))

l <- list(df1, df2, df3)

bind_rows(l, .id = "id")

#>    id v1 v2
#> 1   1  1 58
#> 2   1  2 54
#> 3   1  3 56
#> 4   1  4 57
#> 5   1  5 60
#> 6   2  6 57
#> 7   2  7 56
#> 8   2  8 55
#> 9   2  9 60
#> 10  2 10 52
#> 11  3 11 59
#> 12  3 12 51
#> 13  3 13 52
#> 14  3 14 55
#> 15  3 15 53

Created on 2022-07-08 by the reprex package (v2.0.1)

I don't know about data.table much, that's why used dplyr instead

shafee
  • 15,566
  • 3
  • 19
  • 47
  • That's almost what I am looking to do, except when I run this, it only gives me the placement of the unit within its respective dataframe. I don't want that, I want just the number of the dataframe it was from, like your example shows. Is there some code I'm missing in bind_rows? – flâneur Jul 07 '22 at 19:06
  • Could you show your code? @flâneur. Because the `id` column here is 1 if data comes from dataframe 1 and id is 2 if if data comes from dataframe 2 and so on. `id` column is your desired third column. – shafee Jul 07 '22 at 19:12
  • I cannot easily share code at this time, but I'm wondering if the problem is that the original individual files do have a variable called "id" (giving the number within the individual dataframe. If I then use the ```bind_rows``` function like you suggested, is it mistakenly taking that variable, and not the number of the dataframe (like I want). If so, is there a way around that? – flâneur Jul 07 '22 at 19:20
  • No it should not be that case, if your original dataframes have `id` variable, then those would be overwritten by this new one ,made by `bind_rows` – shafee Jul 07 '22 at 19:24
  • That seems to be the issue. When I use ```bind_rows```, it actually needs to be binded twice, I guess. Which is why I needed to use ```rbindlist``` afterwards. The first ```bind_rows``` only collapses the large list once, but it remains a large list until the ```rbindlist``` below takes care of it. Which is why the id gives individual values, not the number of the dataframe. – flâneur Jul 07 '22 at 19:36
  • Yes, this works now. I just needed to use the .id (or in the ```rbindlist``` function's case "idcol" after the initial ```bind_rows``` to have it report the dataframe number. Apologies for the confusion and thank you for all the help! – flâneur Jul 07 '22 at 19:39
  • @flâneur just use a different name `idcols='group'` – Onyambu Jul 07 '22 at 20:20