1

I have a situation where I need to reformat a dataframe where each row contains an ID value, three years (stored in columns), and three predictions (stored in columns). If the year or prediction is not used then it contains a dummy value (-9999).

Here is an example dataframe:

# Example initial dataframe
sample = data.frame(
  id = c(1,2,3),
  year1 = c(2005, 2005, 2005),
  year2 = c(2009, 2010, 2010),
  year3 = c(2010, -9999, -9999),
  pred1 = c(0, 0, 0),
  pred2 = c(0, 1, 0),
  pred3 = c(1, -9999, -9999)
)

My goal is to efficiently reformat the data frame into one where each row contains the ID, the current year, and the prediction (1/0):

The following code demonstrates the desired output dataframe:

# Code produces desired output format
desired_format = data.frame(
  id = rep(1:3, each=6),
  year = rep(seq(2005, 2010, 1), 3),
  pred = c(c(0,0,0,0,0,1), c(0,1,1,1,1,1), c(0,0,0,0,0,0))
)

Here is a screenshot of the desired format from my R console.

Here is a screenshot of the desired format from my R console.

I've come with with a rather elaborate way of doing this with for-loops. However, I've been unable to come up with dplyr or other data management library -based solution to make reformatting more efficient and more maintainable for anyone looking any my code in the future.

Syfer
  • 4,262
  • 3
  • 20
  • 37
John K.
  • 113
  • 1
  • 7
  • `sample %>% gather(year_var, year, num_range('year', 1:3)) %>% gather(pred_var, pred, num_range('pred', 1:3)) %>% filter(pred != -9999, year != -9999) %>% select(-pred_var, -year_var)`? It's unclear where the values for the intermediary years are coming from – alistaire Mar 13 '18 at 01:03
  • The challenge here is not just the reshaping between the wide and long formats but the need to sequence between the reformatted rows to interpolate the intermediary rows. I'll try and use some of the information form your post to get closer to my desired output! – John K. Mar 13 '18 at 19:39
  • So, I don't see where I can post a solution but this code produces the desired answer: wide_format = sample %>% gather(variable, value, -id) %>% mutate(group = readr::parse_number(variable)) %>% mutate(variable = gsub("\\d","",x = variable)) %>% spread(variable, value) %>% filter(pred != -9999, year != -9999) %>% select(-group) final_format = wide_format %>% group_by(id) %>% expand(year = full_seq(year, 1)) %>% left_join(., wide_format) %>% fill(pred, .direction = "up") – John K. Mar 20 '18 at 15:35

0 Answers0