3

I am hoping to determine an efficient way to convert a single data frame into a list of data frames. Below is my reproducible MWE:

set.seed(1)
ABAge = runif(100)
ABPoints = rnorm(100)
ACAge = runif(100)
ACPoints = rnorm(100)
BCAge = runif(100)
BCPoints = rnorm(100)

A_B <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ABAge, Points = ABPoints)
A_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ACAge, Points = ACPoints)
B_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = BCAge, Points = BCPoints)
A_B$ID <- as.character(A_B$ID)
A_C$ID <- as.character(A_C$ID)
B_C$ID <- as.character(B_C$ID)

listFormat <- list("A_B" = A_B, "A_C" = A_C, "B_C" = B_C)

dfFormat <- data.frame(ID = as.character(paste0("ID", 1:100)), A_B.Age = ABAge, A_B.Points = ABPoints, A_C.Age = ACAge, A_C.Points = ACPoints, B_C.Age = BCAge, B_C.Points = BCPoints)
dfFormat$ID = as.character(dfFormat$ID)

This results in a data frame format (dfFormat) that looks like this:

'data.frame':   100 obs. of  7 variables:
 $ ID        : chr  "ID1" "ID2" "ID3" "ID4" ...
 $ A_B.Age   : num  0.266 0.372 0.573 0.908 0.202 ...
 $ A_B.Points: num  0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C.Age   : num  0.6737 0.0949 0.4926 0.4616 0.3752 ...
 $ A_C.Points: num  0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C.Age   : num  0.814 0.929 0.147 0.75 0.976 ...
 $ B_C.Points: num  1.474 0.677 0.38 -0.193 1.578 ...

and a list of data frames listFormat that looks like this:

List of 3
 $ A_B:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...

I am hoping to come up with an automated way to convert the dfFormat to listFormat. As can be seen in the above objects there are two main conditions:

  1. The column ID is always the first column in dfFormat and it is always the first column in each sublist of listFormat.

  2. The number of sublists is equal to the number of unique column names in dfFormat before the underscore ('_'). In this case, that is three prefixes (ex. "A_B", "A_C", and "B_C"). These prefixes are also the names of the three sublists.

  3. Within each sublist, it contains the number of columns that had the associated prefix ("A_B"). For each sublist, this was two ("Age" and "Points"). These suffixes are the names of the columns.

I asked the reverse question here (i.e. how to go from listFormat to dfFormat) and got some helpful answers that I am learning from. I need to have code to reverse both directions and it seems the reverse direction may need new types of code. I put my attempt below to show how I am stuck!

conUnd <- which(sapply(colnames(dfFormat), function(x) grepl("_", x)))
listName <- sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][1])
uListName <- unique(sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][1]))
listCol <- sapply(colnames(dfFormat[,conUnd]), function(x) strsplit(x, "[.]")[[1]][2])

listFormat = list()
for (i in 1:length(uListName)){
   [Gets messy here trying to define column names based on string variables]
}

Any advice would be greatly appreciated. I know my code is not efficient.

Parfait
  • 104,375
  • 17
  • 94
  • 125
lavenderGem
  • 117
  • 8

3 Answers3

4

You can use split.default in base R -

output <- lapply(split.default(dfFormat[-1], sub("\\..*", "",names(dfFormat[-1]))), 
          function(x) cbind(dfFormat[1], setNames(x, sub(".*\\.", "", names(x)))))
str(output)

#List of 3
# $ A_B:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
#  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
# $ A_C:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
#  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
# $ B_C:'data.frame':   100 obs. of  3 variables:
#  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
#  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
#  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

A solution using the tidyverse. It involves converting the data frame to long format, split the columns, spread it, ans the split the data frame based on the group name. In the last line, as.data.frame(stringsAsFactors = FALSE) is not necessary as tibble is also a data frame. I added it so that I can show you the output is the same as your expected list.

library(tidyverse)

listFormat_output <- dfFormat %>%
  pivot_longer(cols = -ID, names_to = "Type") %>%
  separate(Type, into = c("Group", "Parameter"), sep = "\\.") %>%
  pivot_wider(names_from = Parameter) %>%
  group_split(Group) %>%
  setNames(nm = map_chr(., ~unique(.x$Group))) %>%
  map(~.x %>% select(-Group) %>% as.data.frame(stringsAsFactors = FALSE))

# Check if the output is the same as the expected list
identical(listFormat, listFormat_output)
# [1] TRUE
www
  • 38,575
  • 12
  • 48
  • 84
-2

Using mget, ls, and regular expression seems to get what you want.

DATA:

Edit: Your dataframe dfFormat, which has this structure:

str(dfFormat)
'data.frame':   100 obs. of  7 variables:
 $ ID        : chr  "ID1" "ID2" "ID3" "ID4" ...
 $ A_B.Age   : num  0.266 0.372 0.573 0.908 0.202 ...
 $ A_B.Points: num  0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C.Age   : num  0.6737 0.0949 0.4926 0.4616 0.3752 ...
 $ A_C.Points: num  0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C.Age   : num  0.814 0.929 0.147 0.75 0.976 ...
 $ B_C.Points: num  1.474 0.677 0.38 -0.193 1.578 ...

SOLUTION:

The solution picks up on the column names in dfFormat, matching the columns by which you want to convert the dataframe into a list of dataframes using regex:

listFormat <-  mget(ls(pattern = "^A_B|^A_C|^B_C"))

RESULT:

str(listFormat)
List of 3
 $ A_B:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
  ..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
 $ A_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
  ..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
 $ B_C:'data.frame':    100 obs. of  3 variables:
  ..$ ID    : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
  ..$ Age   : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
  ..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • 3
    Could the person who downvoted please leave a hint at why they did? – Chris Ruehlemann Feb 16 '20 at 15:56
  • I am not the downvoter, but I think you probably misunderstood the question so you got a downvote. The question is how to convert `dfFormat` to `listFormat`, not assemble a list from `A_B`, `A_C`, and `B_C`. We have to assume that A_B`, `A_C`, and `B_C` are not in the working space. – www Feb 16 '20 at 16:20
  • 2
    The OP states "I am hoping to come up with an automated way to convert the `dfFormat` to `listFormat`"--that's exactly what my answer does. – Chris Ruehlemann Feb 16 '20 at 16:45
  • As I stated, we have no idea if `A_B`, `A_C`, and `B_C` are in OP's working space. We have to assume they are not there and the solution needs to start with `dfFormat`. This is my interpretation. You can wait for OP's clarification if he or she wants to. – www Feb 16 '20 at 16:48
  • 1
    I agree with www. I didn't downvote. But the OP was after a way to convert a "single" data frame into a list of data frames. You've provided a nice way to convert "multiple" data frames into a list of data frames. You need to start with `dfFormat`. Read the title of his question again. – Edward Feb 16 '20 at 17:17
  • 3
    Thanks for your comments, both @Edward and @www. My solution does start from a single dataframe, namely OP's data frame `dfFormat`. If that dataframe is in the working space, the solution provided works. – Chris Ruehlemann Feb 16 '20 at 17:21
  • @ChrisRuehlemann I respectfully disagree. Your solution started with `A_B`, `A_C`, and `B_C`, not `dfFormat`. If you showed how you created `A_B`, `A_C`, and `B_C` from `dfFormat`, then I agree with you that you started with `dfFormat`. Please notice that in the real world OP may not have `A_B`, `A_C`, and `B_C`, only `dfFormat`. – www Feb 16 '20 at 17:26
  • 3
    @www The columns names of `dfFormat` contain the strings `A_B`, `A_C`, and `B_C`; using regex they can be matched. – Chris Ruehlemann Feb 16 '20 at 17:32
  • @ChrisRuehlemann I tested your solution. It seems like if we remove `A_B`, `A_C`, and `B_C` first by `rm(A_B)`, `rm(A_C)`, and `rm(B_C)`, your solution will get an empty list. This means your solution is matching the data frames `A_B`, `A_C`, and `B_C`, not the column names in `dfFormat`. – www Feb 16 '20 at 17:37
  • 3
    @www Sorry, disagree. If I do `rm(A_B); rm(A_C); rm(B_C); rm(dfFormat); rm(listFormat)` and start from scratch it all works fine. – Chris Ruehlemann Feb 16 '20 at 17:44
  • @Chris. Start from `dfFormat` only. Nothing else in workspace. Does your solution work? – Edward Feb 16 '20 at 17:49
  • I don't understand how your`ls` can match the column names in `dfFormat`. If you can show that, I am willing to give you an upvote. – www Feb 16 '20 at 18:00
  • I also did not downvote any of these answers. They have all been helpful; I am learning lots of new techniques. I should have stated clearly that, in my mind, I was hoping to find a solution that did not start with `A_B`, `A_C`, and 'B_C` in the working space. But this solution can certainly still be helpful. – lavenderGem Feb 17 '20 at 03:59
  • @ everybody involved in this lengthy discussion: I feel I owe an apology. I *did* overlook the fact that I should have removed the dataframes `A_B`, `A_C`, and `B_C` **after** defining them; thus the df's *were* of course in the working space. So what my solution provides is, as @Edward noted, a way to convert "multiple" data frames into a list of data frames. – Chris Ruehlemann Feb 17 '20 at 14:03