5

I have a dataframe which contains some duplicates. I want to sum rows of two columns where there is a duplicate and then delete the unwanted row.

Here is an example of the data,

Year    ID  Lats     Longs      N   n   c_id
2015    200 30.5417 -20.5254    150 30  4142
2015    200 30.5417 -20.5254    90  50  4142

I want to sum columns N and n into one row. the rest of the information i.e. Lats , Longs , ID and Year is to remain the same e.g.,

Year    ID  Lats    Long        N   n   c_id
2015    200 30.5417 -20.5254    240 80  4142
Sam
  • 7,252
  • 16
  • 46
  • 65
jonestats
  • 349
  • 1
  • 3
  • 10
  • 1
    duplicates with the same c_id like 4142 in the above case. c_id is what uniquely identifies each record. – jonestats Jan 04 '13 at 07:35

1 Answers1

5

Solution using data.table:

require(data.table)
df <- structure(list(year = c(2015, 2015), ID = c(200, 200), Lats = c(30.5417, 
            30.5417), Longs = c(-20.5254, -20.5254), N = c(150, 90), n = c(30, 
            50), c_id = c(4142, 4142)), .Names = c("year", "ID", "Lats", 
            "Longs", "N", "n", "c_id"), row.names = c(NA, -2L), 
            class = "data.frame")
dt <- data.table(df)
dt[, lapply(.SD, sum), by="c_id,year,ID,Lats,Longs"]

   c_id year  ID    Lats    Longs   N  n
1: 4142 2015 200 30.5417 -20.5254  240 80

Solution using plyr:

require(plyr)
ddply(df, .(c_id, year, ID, Lats, Longs), function(x) c(N=sum(x$N), n=sum(x$n)))

  c_id year  ID    Lats    Longs   N  n
1 4142 2015 200 30.5417 -20.5254 240 80
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Am trying to understand this code. i have 9222 records, can this work with this number of records? – jonestats Jan 04 '13 at 07:40
  • 1
    +1 Btw, instead of `by=c("c_id", "year", "ID", "Lats", "Longs")` there's a shortcut: `by="c_id,year,ID,Lats,Longs"`. Comma separated names are just for column names only, though, no expressions. – Matt Dowle Jan 04 '13 at 09:11
  • 1
    Great. Also why wrap the result with `as.data.frame`? Was that just to make it identical to the result from plyr? If so, think it's ok to return a data.table, which is a data.frame too, otherwise newbies might get the impression they really do have to convert the result back to data.frame, which would indeed be a pain. – Matt Dowle Jan 04 '13 at 10:41
  • Hi, am trying the data.table option and i get this error Error in eval(expr, envir, enclos) : object ' YeStart' not found – jonestats Jan 04 '13 at 12:13
  • IDS[, lapply(.SD, sum), by="cid, YeStart, Cluster_ID, Lat, Long"] – jonestats Jan 04 '13 at 12:17
  • > names(IDS) [1] "YeStart" "Cluster_ID" "Lat" "Long" "Ex" "pfpr" "cid" > It works when i encolose " ", but shows results of the first and last five. – jonestats Jan 04 '13 at 12:28
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/22180/discussion-between-jonestats-and-arun) – jonestats Jan 04 '13 at 12:35
  • @jonestats Don't put spaces when comma separating. It's looking for " YeStart" (with a space at the start). data.table allows, maintains and likes spaces and other special characters in column names. – Matt Dowle Jan 04 '13 at 16:31
  • @Matthew, the problem was that he was working on the `data.frame`. He did not convert it to a `data.table`. Its resolved now. Pardon me. I forgot to write back here that its resolved. – Arun Jan 04 '13 at 16:40