0

I have this dataset, that is formatted in a weird way.

It kind of looks like this:

          1971 1971 1971 1972 1972 1972
          var1 var2 var3 var1 var2 var3
person1     37    2    1   65    5    3   
person2     65    2    1  123    3    1   
person3     23    3    1   13    6    2   

I'm wondering if there's any easy way to turn this data into something that looks like:

         year  var1 var2 var3
person1  1971    37    2    1   
person1  1972    65    5    3   
person2  1971    65    2    1 
person2  1972   123    3    1
person3  1971    23    3    1
person3  1972    13    6    2

I think what I'm looking to do is reverse reshape data or something. I have looked into melt, cast and transpose, but I couldnt get any useable results.

Atom Vayalinkal
  • 2,642
  • 7
  • 29
  • 37

2 Answers2

0

Here is my attempt to figure out what you need. Modify it to your will if something is not precise. I used 3 libraries, but don't worry. In R these are very often come together and are good to know for future anyway. I could have written the code with basic R but that would have meant much longer code

input.csv

,1971,1971,1971,1972,1972,1972
,var1,var2,var3,var1,var2,var3
person1,37,2,1,65,5,3
person2,65,2,1,123,3,1
person3,23,3,1,13,6,2

Code to modify representation

library(reshape2)
library(tidyr)
library(dplyr)

input = read.table("input.csv", sep=",", na.strings="", header=T)[-1,]
converted_input = input %>%
  tidyr::gather(year, value, -X) %>%
  dplyr::mutate(
    var=paste0("var", as.numeric(gsub("^X.*", "0", gsub(".*\\.([0-9])$", "\\1", year)))+1),
    year=gsub("X([^.]+).*", "\\1", year)) %>%
  reshape2::dcast(X + year ~ var, value.var="value") %>%
  dplyr::rename(person=X)

print(converted_input)

Final result

 person year var1 var2 var3
person1 1971   37    2    1
person1 1972   65    5    3
person2 1971   65    2    1
person2 1972  123    3    1
person3 1971   23    3    1
person3 1972   13    6    2
Sergej Andrejev
  • 9,091
  • 11
  • 71
  • 108
0

This is a format difficult to work with because the column header information in the original file is from two rows. My approach would be read the all the data in except the first row, and then read the first row later for data manipulation.

Step 1: Read the data except the first row

dat <- read.table(text = "         1971 1971 1971 1972 1972 1972
          var1 var2 var3 var1 var2 var3
                  person1     37    2    1   65    5    3   
                  person2     65    2    1  123    3    1   
                  person3     23    3    1   13    6    2",
                  header = TRUE, stringsAsFactors = FALSE, skip = 1)
dat
#         var1 var2 var3 var1.1 var2.1 var3.1
# person1   37    2    1     65      5      3
# person2   65    2    1    123      3      1
# person3   23    3    1     13      6      2

Step 2: Read the first row as a vector

Year <- read.table(text = "         1971 1971 1971 1972 1972 1972
          var1 var2 var3 var1 var2 var3
                   person1     37    2    1   65    5    3   
                   person2     65    2    1  123    3    1   
                   person3     23    3    1   13    6    2",
                   header = FALSE, stringsAsFactors = FALSE, nrows = 1)

# Convert Year to a vector
Year <- unlist(Year)
Year
#   V1   V2   V3   V4   V5   V6 
# 1971 1971 1971 1972 1972 1972

Step 3: Combine the header information

library(tidyverse)

# Rename the dat dat data frame
dat2 <- dat %>%
  setNames(paste(names(.), Year, sep = "_")) %>%
  setNames(sub("\\.\\d+", "", names(.)))
dat2
#         var1_1971 var2_1971 var3_1971 var1_1972 var2_1972 var3_1972
# person1        37         2         1        65         5         3
# person2        65         2         1       123         3         1
# person3        23         3         1        13         6         2

Now dat2 is a format that we can work with.

Step 4: Reshape the data frame

# Reshape the dtaa frame
dat3 <- dat2 %>%
  rownames_to_column(var = "Person") %>%
  gather(Variable, Value, -Person) %>%
  separate(Variable, into = c("Var", "Year")) %>%
  spread(Var, Value)
dat3
#    Person Year var1 var2 var3
# 1 person1 1971   37    2    1
# 2 person1 1972   65    5    3
# 3 person2 1971   65    2    1
# 4 person2 1972  123    3    1
# 5 person3 1971   23    3    1
# 6 person3 1972   13    6    2

dat3 is the final output.

www
  • 38,575
  • 12
  • 48
  • 84