0

I'm having issue trying to convert rows into columns and then getting only the latest record (using timestamp). Here is my data set:

enter image description here

Below is the code to generate this data set:

df <- data.frame(id = c("123||wa", "123||wa", "123||wa", "223||sa", "223||sa", "223||sa", "123||wa"),
               questions = c("dish", "car", "house", "dish", "house", "car", "dish"),
               answers = c("pasta", "bmw", "yes", "pizza", "yes", "audi","ravioli" ), 
               timestamp = c("03JUL2014:15:38:11", "07JAN2015:15:22:54", "24MAR2018:12:24:16", "24MAR2018:12:24:16",
               "04AUG2014:12:40:30", "03JUL2014:15:38:11", "05FEB2018:17:23:16"))

The desired output is:

enter image description here

code that generated the output:

output <- data.frame(id = c("123||wa", "223||sa"), dish = c("ravioli", "pizza"), 
                 car = c("bmw", "audi"), house = c("yes", "yes"))

NOTE: As you can see in the original data set, there were multiple rows for the id field. More importantly, there were two rows for id '123||wa' regarding their favourite dish but only their latest answer is wanted in the final output.

Any help would be greatly appreciated. Thanks

markus
  • 25,843
  • 5
  • 39
  • 58
SAJ
  • 368
  • 1
  • 2
  • 14
  • Possible duplicate of [How to reshape data from long to wide format?](https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format) – markus Sep 03 '18 at 13:16

2 Answers2

2

You can do with libraries tidyr and dplyr: first summarize by taking last answer and then transform data.frame:

output <-   df%>%
arrange(id, timestamp) %>%
group_by(id, questions)%>%
summarise(last=last(answers))%>%
spread(questions, last)
Nar
  • 648
  • 4
  • 8
2

Most likely the date_time column should be first converted to the correct type (here using ymd_hms from lubridate and strptime), since the extracted value should correspond to the latest record by date_time. After that several functions from dplyr come in handy

library(lubridate)
library(dplyr)
df %>%
  mutate(timestamp = ymd_hms(strptime(timestamp, "%d%b%Y:%H:%M:%S"))) %>%
  group_by(id, questions) %>%
  arrange(timestamp) %>%
  summarise(last = last(answers)) %>%
  spread(questions, last)

#output
# A tibble: 2 x 4
# Groups: id [2]
  id      car   dish    house
* <fct>   <fct> <fct>   <fct>
1 123||wa bmw   ravioli yes  
2 223||sa audi  pizza   yes  

The ymd_hms(strptime(... part can be replaced with:

mutate(timestamp = parse_date_time(timestamp,  orders = "%d%b%Y:%H:%M:%S"))

see

?strptime

on how to construct the date_time format

missuse
  • 19,056
  • 3
  • 25
  • 47