6

I've been provided with a couple of very awkwardly formatted data in excel which I need to reshape so it fits to run a survival analysis in R.

I uploaded an extract of the data to Google drive: https://drive.google.com/open?id=1ret3bCDCYPDALQ16YBloaeopfl2-qVbp
The original data frame has about 2100 observations and 950 variables

Here is the basic data frame:

my.data<-data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my.data

#     ID       LR     LR.1     LR.2
# 1          2012Y    2012Y    2011Y
# 2       State:FL State:AZ State:FL
# 3 C8477        5        5        7
# 4 C5273        6        8        2
# 5 C5566        8       10        1

All the columns have the same name "LR". I don't know if this will be a problem later...

The Year is given in row 1 and the according state the observation happened in row 2.

As output I need to have some panel data that I work with in later survival analysis.

my.data<-data.frame(
  ID=c("C8477","C5273","C5566"),
  Year=c("2012","2012","2011"), 
  State=c("FL","AZ","FL"),LR=c(5,8,1)
) 

my.data

#     ID Year State LR
# 1 C8477 2012    FL  5
# 2 C5273 2012    AZ  8
# 3 C5566 2011    FL  1

I played around with the reshape function and seq functions, but non of that will help me move in the right direction, as the data frame is so oddly arranged.

Joris C.
  • 5,721
  • 3
  • 12
  • 27
Juan
  • 171
  • 1
  • 12
  • I have removed `120046` value from your example. – Bulat Oct 13 '19 at 21:20
  • This could be of help, lo https://stackoverflow.com/questions/23233606/how-to-best-reshape-a-data-set-in-r-that-has-a-two-row-header – Bulat Oct 13 '19 at 21:26

3 Answers3

2

Here is the reshape2 and tidyr version of achieving this:

library(tidyr)
library(reshape2)

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

# Combine first two rows as column names
colnames(my.data) <- paste(unlist(my.data[2, ]), unlist(my.data[1, ]), sep = "|")
# Remove first two rows from data
my.data <- my.data[-c(1:2), ] # negative index removes rows

# Melt data
my.data.long <- melt(
  my.data, 
  id.vars = 1L, # would be better to have explicit col name
  value.name = "LR" 
)
colnames(my.data.long) <- c("ID", "state_year", "LR")

# Split state_year column into two columns:
my.data.long <- separate(
  my.data.long, 
  state_year, 
  into = c("State", "Year"), 
  sep = "\\|" # note this is a regex
)

Idea was borrowed here.

Bulat
  • 6,869
  • 1
  • 29
  • 52
2

This is a tidyverse approach:

my.data <- data.frame(
  ID=c( "", "","C8477","C5273","C5566"),
  LR=c("2012Y","State:FL",5,6,8),
  LR=c("2012Y","State:AZ",5,8,10),
  LR=c("2011Y","State:FL",7,2,1)
)

my code:

library(tidyverse)
year <- as.matrix(my.data[1, -1])
year <- str_split(year, "Y", simplify = T)[,1]
state <-as.matrix(my.data[2, -1])
both<-paste(state, year, sep = "_")
mydata1<-my.data[-c(1, 2), ]
colnames(mydata1) <-c("ID", both)
long <-pivot_longer(mydata1, 
             cols = starts_with("state"),
             names_to = "State_year",
             values_to = "LR")
long %>%
  transmute(
    ID, LR, 
    state = str_split(State_year, "_", simplify = T)[, 1],
    state = str_split(state, ":", simplify = T)[, 2], 
    year = str_split(State_year, "_", simplify = T)[, 2]
)

We get:

  ID    LR    state year 
1 C8477 5     FL    2012 
2 C8477 5     AZ    2012 
3 C8477 7     FL    2011 
4 C5273 6     FL    2012 
5 C5273 8     AZ    2012 
6 C5273 2     FL    2011 
7 C5566 8     FL    2012 
8 C5566 10    AZ    2012 
9 C5566 1     FL    2011  
Zhiqiang Wang
  • 6,206
  • 2
  • 13
  • 27
  • Thanks! Unfortunately, the function "pivot_longer" is not found in my version of R. No idea why. But the rest helps plenty. – Juan Oct 15 '19 at 16:00
  • Check https://www.tidyverse.org/articles/2019/09/tidyr-1-0-0/: “New pivot_longer() and pivot_wider() provide improved tools for reshaping, superceding spread() and gather(). The new functions are substantially more powerful, ...” – Zhiqiang Wang Oct 15 '19 at 18:08
1

A possible base-R approach using reshape:

## 1) extract the State and Year rows in a separate data.frame
dat.meta <- data.frame(
    Year = unlist(dat[1, -1]),
    State = sub("State:", "", dat[2, -1]),
    LR.id = colnames(dat)[-1]
)

## 2) reshape the data without State and Year rows into long format
dat.long <- reshape(
    data = dat[-c(1, 2), ], 
    varying = 2:4,
    direction = "long",
    v.names = "LR",
    timevar = "LR.id",
    times = colnames(dat)[-1]
)

dat.long$id <- NULL  ## remove id-column   

## 3) merge the reshaped data with the extracted State and Year data
merge(dat.long, dat.meta, by = "LR.id")

#>   LR.id    ID LR  Year State
#> 1    LR C8477  5 2012Y    FL
#> 2    LR C5273  6 2012Y    FL
#> 3    LR C5566  8 2012Y    FL
#> 4  LR.1 C8477  5 2012Y    AZ
#> 5  LR.1 C5273  8 2012Y    AZ
#> 6  LR.1 C5566 10 2012Y    AZ
#> 7  LR.2 C8477  7 2011Y    FL
#> 8  LR.2 C5273  2 2011Y    FL
#> 9  LR.2 C5566  1 2011Y    FL

Data

dat <- data.frame(
    ID=c( "", "","C8477","C5273","C5566"),
    LR=c("2012Y","State:FL",5,6,8),
    LR=c("2012Y","State:AZ",5,8,10),
    LR=c("2011Y","State:FL",7,2,1),
    stringsAsFactors = FALSE
)
Joris C.
  • 5,721
  • 3
  • 12
  • 27