How can I tidy a very messy long format data set using tidyverse or base-R functions?

Question

I have a messy and confusing long-format data set (I have started using R very recently and could not master it yet so I need some guidance).

My participants went through different phases in an experiment. In phase a, they rated images. In phase b they saw some images with different affects. In phase c, they rated the images they saw in phase b. I can retrieve all responses, affect information, and images that the participants rated through separate columns. My aim is to analyze responses according to the image affects as (no-affect, positive, negative) and I want to know image numbers corresponding to each response.

The problem is when the phase is over the last value inserted is copied onto the following rows (so should be omitted) and for some columns I have NAs as there is no value above that the program copies.

A simplified version of this dataset looks like this:


      > df
    id phase phase.a.response phase.c.response phase.a.pic
1   1     a                1               NA       x.jpg
2   1     a                2               NA       y.jpg
3   1     a                3               NA       z.jpg
4   1     a               10               NA       d.jpg
5   1     b               10               NA       d.jpg
6   1     b               10               NA       d.jpg
7   1     b               10               NA       d.jpg
8   1     b               10               NA       d.jpg
9   1     c               10                5       d.jpg
10  1     c               10                4       d.jpg
11  1     c               10                2       d.jpg
12  1     c               10                1       d.jpg
      phase.b.pic pic.affect phase.c.pic
1         <NA>       <NA>        <NA>
2         <NA>       <NA>        <NA>
3         <NA>       <NA>        <NA>
4         <NA>       <NA>        <NA>
5        m.jpg   positive        <NA>
6        n.jpg   negative        <NA>
7        p.jpg   positive        <NA>
8        r.jpg   negative        <NA>
9        r.jpg   negative       n.jpg
10       r.jpg   negative       p.jpg
11       r.jpg   negative       r.jpg
12       r.jpg   negative       m.jpg

 data$response[data$phase=="a"]<-data$phase.a.response
 data$response[data$phase=="b"]<-data$phase.b.response

I tried to create a new variable like the one above but did not work due to the NAs (or because my code does not make sense).

Ideally I want to be able to subset my data according to the phases and I want my responses in one column, the phase in one column (which I already have in the data), corresponding images in one column and corresponding image affects in another column (for phase a should have no affect).

`data_a <- subset(data, phase == "a")`? – bob1 May 07 '19 at 15:44 — bob1, May 07 '19 at 15:44

Wimpel · Answer 1 · 2019-05-07T16:06:14.993

A desired output would most certainly help...

Here's a first go using data.table

sample data

library(data.table)
DT <- fread( "id  phase  phase.a.response  phase.c.response  phase.a.pic      phase.b.pic  pic.affect  phase.c.pic
1     a                1               NA       x.jpg1         <NA>       <NA>        <NA>
1     a                2               NA       y.jpg2         <NA>       <NA>        <NA>
1     a                3               NA       z.jpg3         <NA>       <NA>        <NA>
1     a               10               NA       d.jpg4         <NA>       <NA>        <NA>
1     b               10               NA       d.jpg5        m.jpg   positive        <NA>
1     b               10               NA       d.jpg6        n.jpg   negative        <NA>
1     b               10               NA       d.jpg7        p.jpg   positive        <NA>
1     b               10               NA       d.jpg8        r.jpg   negative        <NA>
1     c               10                5       d.jpg9        r.jpg   negative       n.jpg
1     c               10                4       d.jpg10       r.jpg   negative       p.jpg
1     c               10                2       d.jpg11       r.jpg   negative       r.jpg
1     c               10                1       d.jpg12       r.jpg   negative       m.jpg
")

code

#add row_id's
DT[, row := seq_along(id) ]

#melt for response
ans.response <- melt(DT, id.vars = c("row", "id","phase"), 
     measure.vars = patterns(response = "\\.response$"),
     variable.factor = FALSE,
     variable.name = "phase2",
     value.name = "response")[, phase2 := gsub("^phase\\.(.)\\.response", "\\1", phase2)][phase == phase2,][, phase2 := NULL]

#melt for pic
ans.pic <- melt(DT, id.vars = c("row", "id","phase"), 
                     measure.vars = patterns(pic = "\\pic$"),
                     variable.factor = FALSE,
                     variable.name = "phase2",
                     value.name = "pic")[, phase2 := gsub("^phase\\.(.)\\.pic", "\\1", phase2)][phase == phase2,][, phase2 := NULL]
#join
ans.response[ans.pic, on = .(row,id,phase)]

output

#     row id phase response    pic
#  1:   1  1     a        1 x.jpg1
#  2:   2  1     a        2 y.jpg2
#  3:   3  1     a        3 z.jpg3
#  4:   4  1     a       10 d.jpg4
#  5:   5  1     b       NA  m.jpg
#  6:   6  1     b       NA  n.jpg
#  7:   7  1     b       NA  p.jpg
#  8:   8  1     b       NA  r.jpg
#  9:   9  1     c        5  n.jpg
# 10:  10  1     c        4  p.jpg
# 11:  11  1     c        2  r.jpg
# 12:  12  1     c        1  m.jpg

thank you for your response! When I run this code I get variables in my environment but all my observations that are supposed to be copied are gone (I see 0 obs of 4 variables for "ans.pic" and "ans.response" in the envionment — gundedun, May 08 '19 at 12:44
The mock data I put here resembles my original dataset. I have more columns and more measures but these are the ones that I am interested in for now. The code runs without problem but the R studio environment says "No data available in table" - I am not sure how to make this issue reproducible. That being said I am trying to see if I made a mistake in R while changing my variable names. May I ask what "phase2" stands for in the code you suggested — gundedun, May 08 '19 at 14:19
what I meant was the first answer here: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example --- `phase2` is a dummy variable I join on, and delete later on. — Wimpel, May 08 '19 at 17:16

How can I tidy a very messy long format data set using tidyverse or base-R functions?

1 Answers1