Split data frame based on column name pattern and then rebind into 1 data frame

Question

Looking for help on automating a task that I need to do across multiple data sets of multiple types.

Here's some example data:

#create example data
x1_a<-rnorm(3)
x1_b<-rnorm(3)
x2_a<-rnorm(3)
x2_b<-rnorm(3)
x3_a<-rnorm(3)
x3_b<-rnorm(3)

df<-data.frame(x1_a, x1_b, x2_a, x2_b, x3_a, x3_b)

> df
       x1_a       x1_b       x2_a       x2_b       x3_a       x3_b
1 0.2912428  1.2737063 -0.9997475  1.0400489 -0.5323241 -0.2229865
2 2.0147965 -1.0180336  0.1080972  1.1411229  0.4791280  0.6230867
3 0.4189106 -0.6987785 -0.3890773 -0.3292366 -1.6186305  1.3913514

Essentially, this data is in wide format and I need it in long. But I'm not sure how to automate this transformation.

In this data, the numeric value in the column indicates the observation the column belongs to (e.g., x1_a goes with x1_b). The character values indicate what the columns actually are. So, columns matching in characters (e.g., x1_a matches x2_a, they're both "x_a" just from different observations) are the same thing, just repeated observations of that thing.

The long format data would have just the character string variable (here, "x_a" or "x_b"), and would contain the repeated observations of that thing. I'm agnostic as to whether the original numeric value is preserved.

What I'm looking for is an automated way to do the following code across variable length datasets and varying column names (that all match the basic pattern of "x1_a".

#laborious/embarassing, not good way to accomplish goal
df1<-df[,c("x1_a", "x1_b")]
names(df1)<-gsub("[[:digit:]]+", "", names(df1))
df2<-df[,c("x2_a", "x2_b")]
names(df2)<-gsub("[[:digit:]]+", "", names(df1))
df3<-df[,c("x2_a", "x2_b")]
names(df3)<-gsub("[[:digit:]]+", "", names(df1))

#desired data output
df<-do.call(rbind, list(df1, df2, df3))
> df
         x_a        x_b
1  0.2912428  1.2737063
2  2.0147965 -1.0180336
3  0.4189106 -0.6987785
4 -0.9997475  1.0400489
5  0.1080972  1.1411229
6 -0.3890773 -0.3292366
7 -0.9997475  1.0400489
8  0.1080972  1.1411229
9 -0.3890773 -0.3292366

I'm thinking that something automated like this might work: 1) using "grep" to pull out the column names that match in numeric value (e.g., x1_a goes with x1_b), 2) repeating this for as many unique numeric values/groups there are, 3) storing each of those "split" by column name variables in a list, 4) batch renaming the columns in those list elements, and 5) binding them back into a data frame.

Links to previous posts if I've overlooked something are welcome. Thank you in advance for the help!

score 1 · Accepted Answer · answered Jan 01 '17 at 10:22

1

In addition to the answers provided in the link here is a dplyr/tidyr solution:

library(dplyr)
library(tidyr)
gather(df, key, x) %>% 
  mutate(key = factor(grepl("_b", key), labels=c("x_a", "x_b"))) %>%
  group_by(key) %>% 
  mutate(id=row_number()) %>% 
  spread(key, x)

answered Jan 01 '17 at 10:22

adibender

7,288
3
37
41

Works great! Thank you for the demonstration of tidyr, not a package I had been familiar with. I referenced the link to the other post above, had some great suggestions. I ended going with the cbind solution in that post just because the made more intuitive sense to me. Thanks for showing another way to do it and pointing me in the right direction! – griffmer Jan 02 '17 at 02:03

Split data frame based on column name pattern and then rebind into 1 data frame

1 Answers1

Linked