1

I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.

I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.

Note: ID3 contains . I'd like to rename these to "na" before grouping I've tried solutions

tdata$ID3[is.na(tdata$ID3)] <- "NA"   
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"  

But get Error:

In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L,  :
  invalid factor level, NA generated

The data is:

      date     ID1     ID2     ID3 stat1 stat2 stat.3
1 12-03-07  abc123  wxy456  pqr123    10    20     30  
2 12-03-07  abc123  wxy456  pqr123    20    40     60  
3 10-04-07  bcd456  wxy456  hgf356    10    20     40  
4 12-03-07  abc123  wxy456  pqr123    30    60     90  
5  5-09-07  spa234  int345   <NA>     40    50     70  

Desired Output

date      ID1,     ID2,   ID3,   n, stat1, stat2, stat 3  
12-03-07 abc123, wxy456, pqr457, 3,   20,   40, 60  
10-04-07 bcd456, wxy456, hgf356, 1,   10,   20, 40  
05-09-07 spa234, int345, big234, 1 ,  40,   50, 70  

I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R

But I was not successful merging the columns which have to be grouped and tested for similarity.

DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)



Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns 

Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?

Community
  • 1
  • 1
MyLeftS0ck
  • 33
  • 8

2 Answers2

3

We can do this with dplyr. After grouping by the 'ID' columns, add 'date' and 'n' also in the grouping variables, and get the mean of 'stat' columns

library(dplyr)
df1 %>%
   group_by(ID1, ID2, ID3) %>%
   group_by(date = first(date), n =n(), add=TRUE)  %>% 
   summarise_at(vars(matches("stat")), mean)

NOTE: Regarding change the 'NA' to 'big234', we can convert the 'ID3' to character class and change it before doing the above operation

df1$ID3 <- as.character(df1$ID3)
df1$ID3[is.na(df1$ID3)] <- "big234"
akrun
  • 874,273
  • 37
  • 540
  • 662
1

While I find the dplyr solution proposed by akrun very intuitive to use, there is also a nice data.table solution:

Similarly as akrun, I assume that the NA value has been converted to "big234" to get the desired result.

library(data.table)
# convert data.frame to data.table
data <- data.table(df1)

# return the desired output
data[, c(.N, lapply(.SD, mean)),
     by = list(date, ID1,ID2, ID3)]
ira
  • 2,542
  • 2
  • 22
  • 36