3

I provide a small example of a large data frame I am working >1,000 columns & >200 rows. I would like to make per each repetitive row per column a single row instead of two rows, and at the same time take the average of each value related to the Column # 1 (The columns for average are Col 6-8 and so on).


Col1    Col2    Col3    Col4    Col5    Col6        Col7        Col8        Col9        Col10
A-001   7       40      1       J       3.985645    4.231623    2.36987     9.36545852  8.369663
A-001   7       40      1       J       1.458798    7.652123    1.236985    4.236987    1.22556633
B-002   8       50      0       K       5.00212     8.369562    7.4569852   5.36692     4.6632121
B-002   8       50      0       K       9.02336     1.2120145   3.0014588   8.214569    5.2223698
C-003   10      60      1       L       1.451203    5.321455    8.25963     2.03369878  4.3336988
C-003   10      60      1       L       1.65653     2.369898    8.2136999   7.21458777  5.3366
D-004   3       70      0       M       5.323211    1.147852    7.20014     5.36989     2.36555
D-004   3       70      0       M       4.36969     5.231478    4.23698     3.645478    9.214563
E-005   4       80      1       N       8.123256    9.2356478   5.3696      4.698889    7.366695
E-005   4       80      1       N       7.9632145   0.004555    1.24789     7.3696969   1.23655

Expected Output:

col 1   col 2   col 3   col 4   col 5     col 6
A-001       7      40       1       J   2.49282
B-002       8      50       0       K   7.01274
C-003      10      60       1       L   1.55387
D-004       3      70       0       M   4.84645
E-005       4      80       1       N   8.04324

My apologize in advance for not making it in the proper format Anticipated thanks for any help provided in advance

dd <- structure(list(col1 = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 
4L, 5L, 5L), .Label = c("A-001", "B-002", "C-003", "D-004", "E-005"
), class = "factor"), col2 = structure(c(1L, 1L, 2L, 2L, 3L, 
3L, 4L, 4L, 5L, 5L), .Label = c("7", "8", "10", "3", "4"), class = "factor"), 
    col3 = c(40L, 40L, 50L, 50L, 60L, 60L, 70L, 70L, 80L, 80L
    ), col4 = c(1, 1, 0, 0, 1, 1, 0, 0, 1, 1), col5 = c(JL, JL, KL, KL, LL, LL, ML, ML, NL, 
    NL), col6 = c(3.985645, 1.458798, 5.00212, 9.02336, 1.451203, 1.65653, 5.323211,  
    4.36969, 8.123256, 7.9632145), col7 = c(4.231623, 7.652123, 8.369562, 1.2120145,  
    5.321455, 2.369898, 1.147852, 5.231478, 9.2356478, 0.004555), col8 = c(2.36987,  
    1.236985, 7.4569852, 3.0014588, 8.25963, 8.2136999, 7.20014, 4.23698, 5.3696,  
    1.24789), col9 = c(9.36545852, 4.236987, 5.36692, 8.214569, 2.03369878, 7.21458777,  
    5.36989, 3.645478, 4.698889, 7.3696969), col10 = c(8.369663, 1.22556633, 4.6632121,  
    5.2223698, 4.3336988, 5.3366, 2.36555, 9.214563, 7.366695, 1.23655)), .Names =  
    c("col1", "col2", "col3", "col4", "col5", "col6", "col7", "col8", "col9", "col10"),  
    class = "data.frame", row.names = c(NA, -10L))
JNat
  • 1,456
  • 4
  • 34
  • 37

2 Answers2

3

This uses dplyr which will be fast for large data. The first part pulls out unique columns that aren't being aggregated, the second part does the aggregation (means by col1 group). Then the two parts are bound together (bind_cols). I renamed your columns to remove spaces, if you leave spaces you need to use backticks with names (ie `col 1`)

library(dplyr)
cols <- 6:10  # columns to average
bind_cols(
    dat %>% distinct(col1) %>% .[,-cols],  # pull out columns we aren't aggregating
    dat[,c(1, cols)] %>% group_by(col1) %>%
        summarise_each(funs(mean)) %>% .[,-1]  # aggregate other columns
)

#    col1 col2 col3 col4 col5      col6       col7      col8      col9      col10
# 1 A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122    4.79762
# 2 B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075    4.94279
# 3 C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414    4.83515
# 4 D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768    5.79006
# 5 E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429    4.30162

Edit

That was way overly complicated, thanks to @StevenBeaupré the simple version is (the *_each functions can drop columns of course!)

dat %>% group_by(col1) %>% mutate_each(funs(mean), -(1:5)) %>% distinct
Rorschach
  • 31,301
  • 5
  • 78
  • 129
  • Thank you your approach helped me,really appreciated, I will try this approach with the original dataset and see how it works. –  Jul 04 '15 at 01:48
  • When I run the code, the average of the samples work very well but the other columns and the Ids are duplicated. –  Jul 04 '15 at 02:10
  • Never mind was my own mistake, the code is fine, now I fixed the mistake. Thank you. –  Jul 04 '15 at 02:30
  • 2
    This seems overly convoluted. Why not simply: `dd %>% group_by(col1) %>% mutate_each(funs(mean), -(1:5)) %>% distinct` – Steven Beaupré Jul 04 '15 at 14:44
  • 1
    @StevenBeaupré o man, that's so much nicer! mind if I update it to that? – Rorschach Jul 04 '15 at 14:56
2

Using data.table

library(data.table)#v1.9.5+
cols <- 6:10
setDT(dd)[, lapply(.SD, mean), by=c(names(dd)[1:5]), .SDcols=cols]
#    col1 col2 col3 col4 col5      col6       col7      col8      col9
#1: A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122
#2: B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075
#3: C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414
#4: D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768
#5: E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429
#     col10
#1: 4.79762
#2: 4.94279
#3: 4.83515
#4: 5.79006
#5: 4.30162

NOTE: In the example provided, 2nd to 5th column values are same for each group of 'col1'. So, we can use 1:5 columns are the grouping variable.

If the lengths of unique values in col2:col5 are >1 for each 'col1' variable and want to keep the first row of col2:col5 for each col1 group

DT1 <- setDT(dd)[, lapply(.SD, mean), by = col1, .SDcols=cols]
setkey(unique(dd[,-cols, with=FALSE], by='col1'),col1)[DT1]
#    col1 col2 col3 col4 col5      col6       col7      col8      col9
#1: A-001    7   40    1    J   2.49282    5.94187   1.80343   6.80122
#2: B-002    8   50    0    K   7.01274    4.79079   5.22922   6.79075
#3: C-003   10   60    1    L   1.55387    3.84568   8.23667   4.62414
#4: D-004    3   70    0    M   4.84645    3.18967   5.71856   4.50768
#5: E-005    4   80    1    N   8.04324    4.62010   3.30875   6.03429
#     col10
#1: 4.79762
#2: 4.94279
#3: 4.83515
#4: 5.79006
#5: 4.30162
JNat
  • 1,456
  • 4
  • 34
  • 37
akrun
  • 874,273
  • 37
  • 540
  • 662