2

I have a data frame with multiple columns and multiple rows, which looks like:

                V1       V2        V3         V4      V5       V6
  1             1         2         3         13      14       15
  2             4         5         6         16      NA       NA 
  3             7         8         9         19      20       21 
  4             10        11        12        22      23       24

And I want to reshape it into:

                V1       V2        V3       
  1             1         2         3         
  2             4         5         6         
  3             7         8         9         
  4             10        11        12       
  5             13        14        15
  6             16        NA        NA 
  7             19        20        21 
  8             22        23        24

In the original data.frame, keep every 3 columns as a group, such that (V1, V2, V3) is group1, (V4, V5, V6) is group2, etc. Then move group2 - without the order of values changing - to the end of group1, and move group3 to the end of group2.

I tried:

  as.data.frame(matrix(unlist(mydata, use.names=FALSE), ncol=3, byrow=TRUE))

but have the value order problem.

How can I get my desired data structure?

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
Duerna
  • 33
  • 1
  • 8

5 Answers5

4

You've already noticed that unlist gives you the values column-wise:

unlist(df[1:3], use.names = FALSE)
##  [1]  1  4  7 10  2  5  8 11  3  6  9 12

To get the values row-wise, you can use the c(t(...)) idiom:

c(t(df[1:3]))
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12

That would allow you to solve the problem in base R using:

as.data.frame(matrix(c(t(df[1:3]), t(df[4:6])), ncol = 3, byrow = TRUE))
##   V1 V2 V3
## 1  1  2  3
## 2  4  5  6
## 3  7  8  9
## 4 10 11 12
## 5 13 14 15
## 6 16 NA NA
## 7 19 20 21
## 8 22 23 24

Generalized as a function, you can try something like:

splitter <- function(indf, ncols) {
  if (ncol(indf) %% ncols != 0) stop("Not the right number of columns to split")
  inds <- split(sequence(ncol(indf)), c(0, sequence(ncol(indf)-1) %/% ncols))
  temp <- unlist(lapply(inds, function(x) c(t(indf[x]))), use.names = FALSE)
  as.data.frame(matrix(temp, ncol = ncols, byrow = TRUE))
}
splitter(df, 3)

A more flexible "data.table" approach would be something like the following:

library(data.table)
rbindlist(split.default(as.data.table(df), 
                        c(0, sequence(ncol(df)-1) %/% 3)), 
          use.names = FALSE)
##    V1 V2 V3
## 1:  1  2  3
## 2:  4  5  6
## 3:  7  8  9
## 4: 10 11 12
## 5: 13 14 15
## 6: 16 NA NA
## 7: 19 20 21
## 8: 22 23 24
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • 1
    Thank you very much! I have tried the last two code with my real-world large data frame, it works amazing! – Duerna Dec 21 '17 at 05:02
  • @Duerna, Glad to help! `c` and `matrix` and so on should be very fast, and don't require any packages. `data.table` gives you the advantage of working with a large dataset.... – A5C1D2H2I1M1N2O1R2T1 Dec 21 '17 at 05:07
2

I was surprised that no one mentions split.default, which also works on data with more columns:

x <- split.default(df, ceiling(seq_along(df) / 3 ))
do.call(rbind, lapply(x, setNames, names(x[[1]])))

#     V1 V2 V3
# 1.1  1  2  3
# 1.2  4  5  6
# 1.3  7  8  9
# 1.4 10 11 12
# 2.1 13 14 15
# 2.2 16 NA NA
# 2.3 19 20 21
# 2.4 22 23 24

Add make.row.names = FALSE to get rid of the odd row names:

do.call(rbind, c(lapply(x, setNames, names(x[[1]])), list(make.row.names = FALSE)))
#   V1 V2 V3
# 1  1  2  3
# 2  4  5  6
# 3  7  8  9
# 4 10 11 12
# 5 13 14 15
# 6 16 NA NA
# 7 19 20 21
# 8 22 23 24
mt1022
  • 16,834
  • 5
  • 48
  • 71
1

You can solve this problem using data.table:-

df <- data.frame(V1 = c(1, 4, 7, 10), V2 = c(2, 5, 8, 11), V3 = c(3, 6, 9, 12), V4 = c(13, 16, 19, 22), V5 = c(14, NA, 20, 23), V6 = c(15, NA, 21, 24))


library(data.table)
setDT(df)
df1 <- df[, c("V4", "V5", "V6")]
setnames(df1, "V4", "V1")
setnames(df1, "V5", "V2")
setnames(df1, "V6", "V3")
df <- df[, c("V1", "V2", "V3")]
df <- rbind(df, df1)

Output will be:-

   V1 V2 V3
1:  1  2  3
2:  4  5  6
3:  7  8  9
4: 10 11 12
5: 13 14 15
6: 16 NA NA
7: 19 20 21
8: 22 23 24
sm925
  • 2,648
  • 1
  • 16
  • 28
1

A solution using and .

library(dplyr)
library(tidyr)

dt2 <- dt %>%
  gather(Column, Value) %>%
  extract(Column, into = c("Group", "Index"), regex = "([A-Z+])([\\d].*$)",
          convert = TRUE) %>%
  mutate(Index = Index %% 3) %>%
  mutate(Index = ifelse(Index == 0, 3, Index)) %>%
  unite(Column, c("Group", "Index"), sep = "") %>%
  group_by(Column) %>%
  mutate(ID = 1:n()) %>%
  spread(Column, Value) %>%
  select(-ID)
dt2
# # A tibble: 8 x 3
#      V1    V2    V3
# * <int> <int> <int>
# 1     1     2     3
# 2     4     5     6
# 3     7     8     9
# 4    10    11    12
# 5    13    14    15
# 6    16    NA    NA
# 7    19    20    21
# 8    22    23    24

DATA

dt <- read.table(text = "              V1       V2        V3         V4      V5       V6
  1             1         2         3         13      14       15
                 2             4         5         6         16      NA       NA 
                 3             7         8         9         19      20       21 
                 4             10        11        12        22      23       24",
                 header = TRUE)

Update

Here is an example showing the code will also work on larger data frame.

library(dplyr)
library(tidyr)

# Create example data frame
dt <- as_data_frame(matrix(1:60, ncol = 12, byrow = TRUE))

dt2 <- dt %>%
  gather(Column, Value) %>%
  extract(Column, into = c("Group", "Index"), regex = "([A-Z+])([\\d].*$)",
          convert = TRUE) %>%
  mutate(Index = Index %% 3) %>%
  mutate(Index = ifelse(Index == 0, 3, Index)) %>%
  unite(Column, c("Group", "Index"), sep = "") %>%
  group_by(Column) %>%
  mutate(ID = 1:n()) %>%
  spread(Column, Value) %>%
  select(-ID)
dt2
# # A tibble: 20 x 3
#      V1    V2    V3
# * <int> <int> <int>
#  1     1     2     3
#  2    13    14    15
#  3    25    26    27
#  4    37    38    39
#  5    49    50    51
#  6     4     5     6
#  7    16    17    18
#  8    28    29    30
#  9    40    41    42
# 10    52    53    54
# 11     7     8     9
# 12    19    20    21
# 13    31    32    33
# 14    43    44    45
# 15    55    56    57
# 16    10    11    12
# 17    22    23    24
# 18    34    35    36
# 19    46    47    48
# 20    58    59    60
www
  • 38,575
  • 12
  • 48
  • 84
  • Thank you very much! It's works! – Duerna Dec 21 '17 at 02:07
  • @Duerna Please see my updated post. I change the regex to be `([A-Z+])([\\d].*$)`. The reason is I believe the original code can only handle small data frames (column number smaller than 10), which is because the regex will create one digit index even if the original column names have more than one digits (`V11` to be `V1`, `V12` to be `V1`, ...). The new regex can capture that. I also added a data frame with 12 columns to show the code works. – www Dec 21 '17 at 02:14
  • thank you, I have run both code you provided with my original data, and all the two codes could get a result showed "A tibble: 44,400 x 7", it seems all code could handle large data frames. As for why the "dt2" show like " x 7" , not " x 3", I think the reason is my original data header, all the columns header contain same letter : " factor ,x, y" , like " factor1,x1,y1,factor2,x2,y2,factor3,x3,y3,..." . Anyway, thank you very much for you help. – Duerna Dec 21 '17 at 02:49
  • @Duerna `([A-Z+])` is to capture upper case only, so it will not work on your real-world data frame. I did not know your header is more complex than your example data frame. I think for your real-world data, we can use `([A-Za-z].*)` to capture any letters. – www Dec 21 '17 at 03:05
  • Thank you. I will pay attention to describe my data frame. And I tried with ([A-Za-z].*), I got " A tibble: 2,368 x 73", the code captured 73 out of 75 from my real-world data header. The 1st and 2nd code you kindly provided works much better with my real-word data. – Duerna Dec 21 '17 at 03:27
0

Here's a general solution for any number of columns, using dplyr.

Test data data:

# A tibble: 5 x 9
     V1    V2    V3    V4    V5    V6    V7    V8    V9
  <int> <int> <int> <int> <int> <int> <int> <int> <int>
1     1     2     3     4     5     6     7     8     9
2    10    11    12    13    14    15    16    17    18
3    19    20    21    22    23    24    25    26    27
4    28    29    30    31    32    33    34    35    36
5    37    38    39    40    41    42    43    44    45

The code:

for (i in seq(1, ncol(data), by = 3)) {
  if (i == 1) {
    out <- select(data, 1:3)
  } else {
    out <- select(data, i:(i+2)) %>% setNames(names(out)) %>% bind_rows(out, .)
  }
}

The output out:

# A tibble: 15 x 3
      V1    V2    V3
   <int> <int> <int>
 1     1     2     3
 2    10    11    12
 3    19    20    21
 4    28    29    30
 5    37    38    39
 6     4     5     6
 7    13    14    15
 8    22    23    24
 9    31    32    33
10    40    41    42
11     7     8     9
12    16    17    18
13    25    26    27
14    34    35    36
15    43    44    45
Stuart Allen
  • 1,537
  • 1
  • 10
  • 19