Removing Duplicate rows and Calculate the Average in a Dataframe in R

Question

I provide a small example of a large data frame I am working >1,000 columns & >200 rows. I would like to make per each repetitive row per column a single row instead of two rows, and at the same time take the average of each value related to the Column # 1 (The columns for average are Col 6-8 and so on).

Col1    Col2    Col3    Col4    Col5    Col6        Col7        Col8        Col9        Col10
A-001   7       40      1       J       3.985645    4.231623    2.36987     9.36545852  8.369663
A-001   7       40      1       J       1.458798    7.652123    1.236985    4.236987    1.22556633
B-002   8       50      0       K       5.00212     8.369562    7.4569852   5.36692     4.6632121
B-002   8       50      0       K       9.02336     1.2120145   3.0014588   8.214569    5.2223698
C-003   10      60      1       L       1.451203    5.321455    8.25963     2.03369878  4.3336988
C-003   10      60      1       L       1.65653     2.369898    8.2136999   7.21458777  5.3366
D-004   3       70      0       M       5.323211    1.147852    7.20014     5.36989     2.36555
D-004   3       70      0       M       4.36969     5.231478    4.23698     3.645478    9.214563
E-005   4       80      1       N       8.123256    9.2356478   5.3696      4.698889    7.366695
E-005   4       80      1       N       7.9632145   0.004555    1.24789     7.3696969   1.23655

Expected Output:

col 1   col 2   col 3   col 4   col 5     col 6
A-001       7      40       1       J   2.49282
B-002       8      50       0       K   7.01274
C-003      10      60       1       L   1.55387
D-004       3      70       0       M   4.84645
E-005       4      80       1       N   8.04324

My apologize in advance for not making it in the proper format Anticipated thanks for any help provided in advance

dd <- structure(list(col1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L, 5L), .Label = c("A-001", "B-002", "C-003", "D-004", "E-005"
), class = "factor"), col2 = structure(c(1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 5L, 5L), .Label = c("7", "8", "10", "3", "4"), class = "factor"), 
    col3 = c(40L, 40L, 50L, 50L, 60L, 60L, 70L, 70L, 80L, 80L
    ), col4 = c(1, 1, 0, 0, 1, 1, 0, 0, 1, 1), col5 = c(JL, JL, KL, KL, LL, LL, ML, ML, NL, 
    NL), col6 = c(3.985645, 1.458798, 5.00212, 9.02336, 1.451203, 1.65653, 5.323211,  
    4.36969, 8.123256, 7.9632145), col7 = c(4.231623, 7.652123, 8.369562, 1.2120145,  
    5.321455, 2.369898, 1.147852, 5.231478, 9.2356478, 0.004555), col8 = c(2.36987,  
    1.236985, 7.4569852, 3.0014588, 8.25963, 8.2136999, 7.20014, 4.23698, 5.3696,  
    1.24789), col9 = c(9.36545852, 4.236987, 5.36692, 8.214569, 2.03369878, 7.21458777,  
    5.36989, 3.645478, 4.698889, 7.3696969), col10 = c(8.369663, 1.22556633, 4.6632121,  
    5.2223698, 4.3336988, 5.3366, 2.36555, 9.214563, 7.366695, 1.23655)), .Names =  
    c("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10"),  
    class = "data.frame", row.names = c(NA, -10L))

you should google "R find duplicate rows" I quickly saw a lot, including this http://stackoverflow.com/questions/13967063/remove-duplicate-rows-in-r — Michael Bellhouse, Jul 04 '15 at 00:59
Hint: http://stackoverflow.com/search?q=[r]+remove+duplicate+rows — , Jul 04 '15 at 01:03
`aggregate(. ~ col1 + col2, data = dd, FUN = mean)` might get slow for large data frame, however — rawr, Jul 04 '15 at 01:08
What is supposed to happen with the columns you are not averaging (9:10)? — Rorschach, Jul 04 '15 at 01:20
The output I provided was to give an example, but is the same approach as the column #6 — , Jul 04 '15 at 01:22
Is the same expected output for all the columns after col #6 — , Jul 04 '15 at 01:23

Rorschach · Accepted Answer · 2015-07-04T15:01:30.020

3

This uses dplyr which will be fast for large data. The first part pulls out unique columns that aren't being aggregated, the second part does the aggregation (means by col1 group). Then the two parts are bound together (bind_cols). I renamed your columns to remove spaces, if you leave spaces you need to use backticks with names (ie `col 1`)

library(dplyr)
cols <- 6:10  # columns to average
bind_cols(
    dat %>% distinct(col1) %>% .[,-cols],  # pull out columns we aren't aggregating
    dat[,c(1, cols)] %>% group_by(col1) %>%
        summarise_each(funs(mean)) %>% .[,-1]  # aggregate other columns
)

#    col1 col2 col3 col4 col5      col6       col7      col8      col9      col10
# 1 A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122    4.79762
# 2 B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075    4.94279
# 3 C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414    4.83515
# 4 D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768    5.79006
# 5 E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429    4.30162

Edit

That was way overly complicated, thanks to @StevenBeaupré the simple version is (the *_each functions can drop columns of course!)

dat %>% group_by(col1) %>% mutate_each(funs(mean), -(1:5)) %>% distinct

edited Jul 04 '15 at 15:01

answered Jul 04 '15 at 01:14

Rorschach

31,301
5
78
129

Thank you your approach helped me,really appreciated, I will try this approach with the original dataset and see how it works. – Jul 04 '15 at 01:48
When I run the code, the average of the samples work very well but the other columns and the Ids are duplicated. – Jul 04 '15 at 02:10
Never mind was my own mistake, the code is fine, now I fixed the mistake. Thank you. – Jul 04 '15 at 02:30
2

This seems overly convoluted. Why not simply: `dd %>% group_by(col1) %>% mutate_each(funs(mean), -(1:5)) %>% distinct` – Steven Beaupré Jul 04 '15 at 14:44
1

@StevenBeaupré o man, that's so much nicer! mind if I update it to that? – Rorschach Jul 04 '15 at 14:56

score 2 · Answer 2 · edited Aug 03 '15 at 17:03

Using data.table

library(data.table)#v1.9.5+
cols <- 6:10
setDT(dd)[, lapply(.SD, mean), by=c(names(dd)[1:5]), .SDcols=cols]
#    col1 col2 col3 col4 col5      col6       col7      col8      col9
#1: A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122
#2: B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075
#3: C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414
#4: D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768
#5: E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429
#     col10
#1: 4.79762
#2: 4.94279
#3: 4.83515
#4: 5.79006
#5: 4.30162

NOTE: In the example provided, 2nd to 5th column values are same for each group of 'col1'. So, we can use 1:5 columns are the grouping variable.

If the lengths of unique values in col2:col5 are >1 for each 'col1' variable and want to keep the first row of col2:col5 for each col1 group

DT1 <- setDT(dd)[, lapply(.SD, mean), by = col1, .SDcols=cols]
setkey(unique(dd[,-cols, with=FALSE], by='col1'),col1)[DT1]
#    col1 col2 col3 col4 col5      col6       col7      col8      col9
#1: A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122
#2: B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075
#3: C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414
#4: D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768
#5: E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429
#     col10
#1: 4.79762
#2: 4.94279
#3: 4.83515
#4: 5.79006
#5: 4.30162

Removing Duplicate rows and Calculate the Average in a Dataframe in R

2 Answers2

Edit

Linked