I have recorded pupil size in response to emotional vs. neutral sounds which were one of two colours and am working on preparing the data for growth curve analysis for which I need to remove trials with excess blinking and then interpolate the remaining pupil values so that the final version has not NA values.
Right now, I have a dataframe with one ID variable ("sound"), one measure ("pupilsize") and 3 variables ("time", "valence" and "colour").
The "time" variable starts from 0 for each sound (each sound represents 1 trial) and increases in increments of 100 (ms). "valence" and "colour" have one value each for every sound.
I would like to eliminate all rows of each trial that has >50% NA
values in the measure "pupilsize".
So far, I have attempted to use reshape2
to convert the file into wide format like so:
widedata <-dcast(data, time ~ sound + valence + colour, value.var = "pupilsize")
This generates columns that are a combination of the sound, valence and colour (e.g. if sound = x.wav, valence =1 and colour =2, the column header is x.wav_1_2)
I've then successfully removed the columns with >50% NA
values by calculating the % of NA
values per sound and removing these from the dataframe.
I would like to convert this modified wide-format file back to the long-format using melt
. However, I'm struggling to find a way of taking apart the column headers and turning them back into "sound", "valence" and "colour".
My question is therefore:
Is there a way of splitting a header in wide format into it's components (e.g. turning x.wav_1_2 into x.wav, 1 and 2)?
If not, is there a way I could remove trials with >50% NA
from the long format without reshaping?
Thank you for any help on this!
Edit (data examples):
The original long format (which is how I would like the data to look at the end)
time valence pupilsize colour sound
1 0 1 45.43 2 1300s.wav
2 100 1 43.22 2 1300s.wav
3 200 1 41.42 2 1300s.wav
4 300 1 40.09 2 1300s.wav
.
.
.
51 5000 1 43.02 2 1300.wav
52 0 2 55.5 1 5461.wav
53 100 2 52.4 1 5461.wav
The wide format when I run dcast on the above data with time as a id.var and colour, valence and sound as the variables (pupilsize is the measure)
time 1300s.wav_1_2 5461s.wav_2_1 ....
1 0 45.43 43.02
2 100 43.43 55.5
3 200 41.42 52.4
4 300 40.09 50.2
.
.
.