0

-------------------NEW POST: I've posted incorrect example of my data in past (leaving it below). In reality my data has repetitive "Modules" under same column and previous solution doesn't work for my problem.

My example data (current dataset):

Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
Module1 <- c("English", "English", "Science", "English")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("History", "History", "History", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("Art", "Geography", "Math", "Geography")
Results3 <- c(89, 84, 45, 67)
Module14 <- c("Math", "Math", "Geography", "Art")
Results14 <- c(89, 24, 95, 67)
Module15 <-c("Science", "Art", "Art", "Science")
Results15 <-c(87, 24, 25, 67)

daf <- data.frame(Id, Year, Grade, Code, Module1, Results1, Module2, Results2, Module3, Results3, Module14, Results14, Module15, Results15)

My target - dataset I need to achieve:

Year <- c("2013", "2020", "2015", "2012")
Grade <- c(28, 39, 76, 54)
Code <- c("A", "B", "C", "A")
English <- c(45, 58,NA,54)
Math <- c(89, 24,45, NA)
Science <- c(87, NA, 34, 67)
Geography <- c(NA, 84, 95,67)
Art <- c(89,24,25,45)

wished_df <- data.frame(Id, Year, Grade, Code, English, Math, Science,Geography, Art)

Thanks again for any help!

-------------------------------- OLD POST: I am trying to reshape my current data to new format.

Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)

daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)

What I need is module names set as ‘variable names’, and module results set as ‘values for variable names’, looking like:

English1 <- c(45, 98, 45)
Math1 <- c(58, 12, NA)
Science1 <- c(34, NA, NA)
Geography1 <- c(54,NA, 67)
Art1 <- c(NA, 45, 84)

wished_df <- data.frame(English1, Math1, Science1,Geography1, Art1)

Thank you for any ideas.

boxi
  • 3
  • 3
  • Use `tidyr::pivot_wider` – AnilGoyal Feb 20 '21 at 17:21
  • 1
    Does this answer your question? [How to reshape data from long to wide format](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) – AnilGoyal Feb 20 '21 at 17:24
  • 1
    @Anil, That question doesn't come near the complexity of this one. This one has multiple column groups and requires converting to long form and then back to a different wide form. – G. Grothendieck Feb 20 '21 at 23:36

2 Answers2

3

1) reshape Using the data in the Note at the end, split the input column names into two groups (Module columns and Results columns) giving varying. Using that reshape to long form where varying= defines which columns in the input correspond to a single column in the long form. v.names= specifies the names to use for each of the two columns produced from the varying columns. reshape will give a data frame with columns time, Module, Result and id columns. We don't need the id column so drop it using [-4].

Then reshape that back to the new wide form. idvar= specifies the source of the output rows and timevar= specifies the source of the output columns. Everything else is the body of the result. reshape will generate a time column which we don't need so remove it using [-1]. At the end we remove the junk part of each column name.

No packages are used.

varying <- split(names(daf), sub("\\d+$", "", names(daf)))
long <- reshape(daf, dir = "long", varying = varying, v.names = names(varying))[-4]
wide <- reshape(long, dir = "wide", idvar = "time", timevar = "Module")[-1]
names(wide) <- sub(".*[.]", "", names(wide))  

giving:

> wide
    English Math Science Geography History Art
1.1      45   58      34        54      NA  NA
1.2      98   12      NA        NA      67  45
1.3      45   NA      NA        67      89  84

2) pivot_ Using the data in the Note at the end, specify that all columns are to be used and using .names specify that the column names in long form are taken from the first portion of the column names of the input where the names of the input are split according to the names_pattern= regular expression. Then pivot to a new wide form where the column names are taken from the Module column and the values in the body of the result are taken from the Results column. The index column will define the rows and can be omitted afterwards.

library(dplyr)
library(tidyr)

daf %>%
  pivot_longer(everything(), names_to = c(".value", "index"), 
    names_pattern = "(\\D+)(\\d+)") %>%
  pivot_wider(names_from = Module, values_from = Results) %>%
  select(-index)

giving:

# A tibble: 3 x 6
  English  Math History   Art Science Geography
    <dbl> <dbl>   <dbl> <dbl>   <dbl>     <dbl>
1      45    58      NA    NA      34        54
2      98    12      67    45      NA        NA
3      45    NA      89    84      NA        67

3) unlist/tapply UUsing the data in the Note at the end, another base solution can be fashioned by separately unlisting the Module and Results columns to get the long form and using tapply to convert to wide form. No packages are used

is_mod <- grepl("Module", names(daf))
long <- data.frame(Module = unlist(daf[is_mod]), Results = unlist(daf[!is_mod]))
tab <- tapply(long$Results, list(sub("\\d+$", "", rownames(long)), long$Module), sum)
as.data.frame.matrix(tab)

giving:

        Art English Geography History Math Science
Module1  NA      45        54      NA   58      34
Module2  45      98        NA      67   12      NA
Module3  84      45        67      89   NA      NA

Note

Module1 <- c("English", "Math", "Science", "Geography")
Results1 <- c(45, 58, 34, 54)
Module2 <- c("Math", "History", "English", "Art")
Results2 <- c(12, 67, 98, 45)
Module3 <- c("History", "Art", "English", "Geography")
Results3 <- c(89, 84, 45, 67)
daf <- data.frame(Module1, Results1, Module2, Results2, Module3, Results3)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Have changed all three to use \\d+ to handle any number of digits. Try them now. – G. Grothendieck Feb 20 '21 at 23:19
  • Thank you so much G. Grothendieck, all of your suggestions are working on my test data. However, I have problem applying it to my real df. Could you please advise how to cover my Modules and Results that have 2 numbers in title (example: Module10, Results10...). I am thinking about this line: varying <- split(names(daf), sub(".$", "", names(daf))) It does what you say, but I need to change something in sub(".$", "", part so my Module10, Module11 and Module12... also fall into the new Module group. – boxi Feb 20 '21 at 23:24
0

A data.table version:

library(data.table)
library(magrittr)
dt <- as.data.table(daf)
dt %>%
  melt.data.table(measure.vars = patterns("^Module", "^Result")) %>%
  dcast.data.table(variable ~ ..., value.var = "value2")

giving:

Key: <variable>
   variable   Art English Geography History  Math Science
     <fctr> <num>   <num>     <num>   <num> <num>   <num>
1:        1    NA      45        54      NA    58      34
2:        2    45      98        NA      67    12      NA
3:        3    84      45        67      89    NA      NA
Rain Song
  • 91
  • 6
  • Put `%>% .[, -1]` into the end of the code if you don't want keep the first column, `variable` – Rain Song Feb 21 '21 at 00:21
  • Thank you Rain Song, your solutions works perfect as well. However, I am possibly missing to do something when applying your and G. Grothendieck's solutions. They work perfect on my test data, and they work on my large dataset but the new (wide) df is holding only 12 rows instead of mine 956 rows from initial df. – boxi Feb 22 '21 at 22:04
  • That is weird. Maybe you try this: ```dt %>% melt.data.table(measure.vars = patterns("^Module", "^Result")) %>% dcast.data.table(variable ~ value1, value.var = "value2")``` . Otherwise, I probably need your raw data to see what the problem is. – Rain Song Feb 23 '21 at 14:39