I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.
I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.
Note: ID3 contains . I'd like to rename these to "na" before grouping I've tried solutions
tdata$ID3[is.na(tdata$ID3)] <- "NA"
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"
But get Error:
In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L, :
invalid factor level, NA generated
The data is:
date ID1 ID2 ID3 stat1 stat2 stat.3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 pqr123 30 60 90
5 5-09-07 spa234 int345 <NA> 40 50 70
Desired Output
date ID1, ID2, ID3, n, stat1, stat2, stat 3
12-03-07 abc123, wxy456, pqr457, 3, 20, 40, 60
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
05-09-07 spa234, int345, big234, 1 , 40, 50, 70
I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R
But I was not successful merging the columns which have to be grouped and tested for similarity.
DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?