Removing trials with >50% NA values from dataframe in long format

Question

I have recorded pupil size in response to emotional vs. neutral sounds which were one of two colours and am working on preparing the data for growth curve analysis for which I need to remove trials with excess blinking and then interpolate the remaining pupil values so that the final version has not NA values.

Right now, I have a dataframe with one ID variable ("sound"), one measure ("pupilsize") and 3 variables ("time", "valence" and "colour").

The "time" variable starts from 0 for each sound (each sound represents 1 trial) and increases in increments of 100 (ms). "valence" and "colour" have one value each for every sound.

I would like to eliminate all rows of each trial that has >50% NA values in the measure "pupilsize".

So far, I have attempted to use reshape2 to convert the file into wide format like so:

widedata <-dcast(data, time ~ sound + valence + colour, value.var = "pupilsize")

This generates columns that are a combination of the sound, valence and colour (e.g. if sound = x.wav, valence =1 and colour =2, the column header is x.wav_1_2)

I've then successfully removed the columns with >50% NA values by calculating the % of NA values per sound and removing these from the dataframe.

I would like to convert this modified wide-format file back to the long-format using melt. However, I'm struggling to find a way of taking apart the column headers and turning them back into "sound", "valence" and "colour".

My question is therefore: Is there a way of splitting a header in wide format into it's components (e.g. turning x.wav_1_2 into x.wav, 1 and 2)? If not, is there a way I could remove trials with >50% NA from the long format without reshaping?

Thank you for any help on this!

Edit (data examples):

The original long format (which is how I would like the data to look at the end)

    time    valence pupilsize colour sound
1   0          1    45.43       2   1300s.wav
2   100        1    43.22       2   1300s.wav
3   200        1    41.42       2   1300s.wav
4   300        1    40.09       2   1300s.wav
.
.
.
51  5000       1    43.02       2   1300.wav
52  0          2    55.5        1   5461.wav 
53  100        2    52.4        1   5461.wav

The wide format when I run dcast on the above data with time as a id.var and colour, valence and sound as the variables (pupilsize is the measure)

    time    1300s.wav_1_2   5461s.wav_2_1   ....
1   0          45.43            43.02   
2   100        43.43            55.5    
3   200        41.42            52.4    
4   300        40.09            50.2    
.
.
.

Hi, welcome to SO. This looks like a good question but could be made more clear by providing sample data. It is not quite clear what the starting data set looks like, and it is not quite clear what you want the final product to look like. — C8H10N4O2, Jun 29 '15 at 17:34
Going to wide format first could work, but is probably not necessary. As @C8H10N4O2 says, hard to say without sample data. [See here for reproducibility tips](http://stackoverflow.com/q/5963269/903061). — Gregor Thomas, Jun 29 '15 at 18:12
Thank you, I've added a data preview of the long and wide versions of the data. — Isabel Hutchison, Jun 29 '15 at 18:52

score 2 · Answer 1 · answered Jun 29 '15 at 18:17

2

Here's a guess:

library(dplyr)

group_by(your_data, sound) %>%
    mutate(prop_na = sum(is.na(pupilsize)) / n()) %>%
    filter(prop_na <= 0.5) %>%
    select(-prop_na)

From your description, it doesn't sound like valence or colour variables matter, so this process ignores them, grouping by the sound ID, calculating an NA proportion at the group level, and eliminating groups with more than 50% NAs. It ends by removing the temporary column.

answered Jun 29 '15 at 18:17

Gregor Thomas

136,190
20
167
294

Thank you for your advice! I've tried the code above, but get the following error messages: > > group_by(dataAtoD, sound) Error: unexpected '>' in ">" > + mutate(prop_na = sum(is.na(pupil_corr))/ n()) Error in is.data.frame(.data) : argument ".data" is missing, with no default > + filter(prop_na <- 0.5) Error in UseMethod("filter_") : no applicable method for 'filter_' applied to an object of class "c('double', 'numeric')" > + select(-prop_na) Error in UseMethod("select_") : no applicable method for 'select_' applied to an object of class "c('double', 'numeric')" – Isabel Hutchison Jun 29 '15 at 18:48
I expect this is because "sound" is a string value. Would I have to temporally replace the sound names with numbers to make this work? – Isabel Hutchison Jun 29 '15 at 18:51
@Gregor You could simplify to `group_by(your_data, sound) %>% filter(sum(is.na(pupilsize)) / n() <= 0.5)` – Steven Beaupré Jun 29 '15 at 19:00
@IsabelHutchison no, sound being a string has nothing to do with that--works for numeric or factor or string (or Date or POSIX or...). Make sure you didn't miss a parenthesis. Also try running the first line, the first two lines, the first three lines, etc., to see where the problem is. – Gregor Thomas Jun 29 '15 at 19:06
1

@StevenBeaupré True, but I like the transparency of the code in my answer---easy to run part of it and "see" how it works. – Gregor Thomas Jun 29 '15 at 19:07
@Gregor I tried running it line by line (I can't find any parantheses errors). The grouping command just reproduces the data normally, no error message. The mutate line results in the following error message: Error in is.data.frame(.data) : argument ".data" is missing, with no default I tried replacing n() with 51 (the number of rows per sound-trial) and tried replacing pupilsize with data$pupilsize, but get the same error message each time. – Isabel Hutchison Jun 29 '15 at 19:59
I've found this related quesition but am unsure how I could apply it http://stackoverflow.com/questions/28953269/why-do-i-sometimes-have-to-enclose-in-data-frame-for-a-named-argument-in – Isabel Hutchison Jun 29 '15 at 20:03
Are your `magrittr` and `dplyr` packages up-to-date? Maybe `update.packages()`... – Gregor Thomas Jun 29 '15 at 20:04
@Gregor thanks for the tipp. I've just updated both packages and now i'm getting the error message: Error in eval(expr, envir, enclos) : object 'pupilsize' not found I have the newest version of R (3.2.1) and am running this in R studio. How is it not recognizing a variable within the dataframe? Do I need to mention it again in each line? – Isabel Hutchison Jun 29 '15 at 20:31
No, you don't. Maybe you also loaded `plyr` and loaded it after you loaded `dplyr`? If so you could specify `dplyr::mutate()` instead of `mutate()`. – Gregor Thomas Jun 29 '15 at 21:00

Removing trials with >50% NA values from dataframe in long format

1 Answers1