1

Does anybody know how to sum rows of selected columns under a particular condition?

For instance, I have five columns whose rows are sorted by year from 2000 to 2008. I need to sum only those rows that are in "year<2006" and add a new total column (with NA's since other years weren't involved).

I suppose group_by won't work because I do not need to sum by group

My data is

A <- c(1,2,3,4,5,6,7,8,9,10)
B <- c(1,2,3,4,5,6,7,8,9,10)
Year <- c(2000, 2001, 2000, 2001, 2003, 2004, 2005, 2006, 2007, 2008)

dta <- as.data.frame(A,B, Year)

I would love to obtain something like

TotalColumn Year
 2          2000
 4          2001      
 6          2000
 8          2001
 10         2003
 12         2004
 14         2005
 NA         2006
 NA         2007
 NA         2008
Laura
  • 306
  • 3
  • 12
  • 2
    It's easier to help if you provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) that makes it clear what your input is and what your desired output is. That way possible solutions can be tested. – MrFlick Mar 14 '17 at 19:05
  • And an ideal reproducible example in R involves creating a sample dataframe with the command `your_table <- data.frame(...)` – Rodrigo Mar 14 '17 at 19:11
  • 1
    `dta <- (A,B, Year)` is not proper R code and will give an error. – Frank Mar 14 '17 at 19:52
  • thank you. now it should work – Laura Mar 14 '17 at 20:04

3 Answers3

3

ifelse may be a good choice here:

 A <- c(1,2,3,4,5,6,7,8,9,10)
 B <- c(1,2,3,4,5,6,7,8,9,10)
 Year <- c(2000, 2001, 2000, 2001, 2003, 2004, 2005, 2006, 2007, 2008)

 dta <- as.data.frame(cbind(rep(NA, each = length(A)), Year))
 colnames(dta) <- c("TotalColumn", "Year")
 dta$TotalColumn <- ifelse(dta$Year < 2006, A + B, NA)

Output of dta

   TotalColumn Year
1            2 2000
2            4 2001
3            6 2000
4            8 2001
5           10 2003
6           12 2004
7           14 2005
8           NA 2006
9           NA 2007
10          NA 2008
Phantom Photon
  • 768
  • 2
  • 10
  • 20
  • thank you a lot. however, I need to sum only those rows that are <2006 – Laura Mar 14 '17 at 19:36
  • Unfortunately, it says `Warning message: In table(dta$TotalColumn) : data set ‘df$TotalColumn’ not found` – Laura Mar 14 '17 at 20:48
  • It is defined as `dta`, so you will need to call `dta$TotalColumn`, `table(dta$TotalColumn)` gives me no warning messages with the base R package (R version 3.3.2). – Phantom Photon Mar 14 '17 at 20:53
  • It gives me NAs – Laura Mar 14 '17 at 20:56
  • If you applied the above code exactly there should be no NA's in the `table`. Check the `str` of the data frame with `str(dta)`, I'm getting: `> str(dta) 'data.frame': 10 obs. of 2 variables: $ TotalColumn: num 2 4 6 8 10 12 14 NA NA NA $ Year : num 2000 2001 2000 2001 2003 ...` – Phantom Photon Mar 14 '17 at 21:03
  • The data you're adding should be numeric or `num`, if you try to add `factor` data to one another you're going to get NA's. When you read in your data, try `read.csv("...", stringsAsFactors = FALSE)` as an argument, or just make sure: `A <- as.numeric(A)` etc. – Phantom Photon Mar 14 '17 at 21:06
  • I think you have factors in your data frame and you need to convert them to numeric before adding. Try: `dta$TotalColumn <- ifelse(dta$Year < 2006, as.numeric(A) + as.numeric(B), NA)` – Phantom Photon Mar 14 '17 at 21:15
2

Using data.table (updated based on Frank's comment)

library(data.table)
A <- c(1,2,3,4,5,6,7,8,9,10)
B <- c(1,2,3,4,5,6,7,8,9,10)
Year <- c(2000, 2001, 2000, 2001, 2003, 2004, 2005, 2006, 2007, 2008)
dta <- data.table(A, B, Year)
dta[Year < 2006, TotalColumn := A+B][, .(TotalColumn, Year)]

Or you can use Frank's suggestion to edit dta in place by replacing the last line with

dta[Year < 2006, TotalColumn := A+B][, c("A", "B") := NULL]

Result:

    TotalColumn Year
 1:           2 2000
 2:           4 2001
 3:           6 2000
 4:           8 2001
 5:          10 2003
 6:          12 2004
 7:          14 2005
 8:          NA 2006
 9:          NA 2007
10:          NA 2008
Woody
  • 48
  • 6
  • Fyi, `data.table(A, B, Year)` will auto assign names, just like `data.frame()` does. Also, you could drop A and B instead of filtering them out, like `[, c("A","B") := NULL ]` – Frank Mar 14 '17 at 19:53
  • It says "Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":=")." Have you had that before? – Laura Mar 14 '17 at 20:18
  • I also have to add that new variable TotalColumn to an existing data frame. Therefore, I cannot use data.table. – Laura Mar 14 '17 at 20:36
  • @Laura use `setDT(dta)` before you try this solution. and "* I also have to add that new variable TotalColumn to an existing data frame. Therefore, I cannot use data.table.*" <- that sentence doesn't make any sense whatsoever. – David Arenburg Mar 14 '17 at 21:00
  • sorry, I thought it was something to do with data.table. I tried to use `setDT()` but not it gives me NAs. – Laura Mar 14 '17 at 21:08
0

Try using the by_row function in purrr

A <- c(1,2,3,4,5,6,7,8,9,10)
B <- c(1,2,3,4,5,6,7,8,9,10)
Year <- c(2000, 2001, 2000, 2001, 2003, 2004, 2005, 2006, 2007, 2008)
dta <- data.frame(A,B, Year)


Total_col <- dta %>% 
  filter(Year < 2006) %>% 
  select(A,B) %>% 
  purrr::by_row(sum, .collate = "cols", .to = "Total_Col") 

yr_total_Col <- dta %>% filter(Year < 2006) %>% select(Year)
Total_col <- cbind(Total_col,yr_total_Col)

This should give you.

dta.x <- full_join(dta,Total_col) %>% select(Year,Total_Col)

#       Year Total_Col
#   1  2000         2
#   2  2001         4
#   3  2000         6
#   4  2001         8
#   5  2003        10
#   6  2004        12
#   7  2005        14
#   8  2006        NA
#   9  2007        NA
#   10 2008        NA
msubbaiah
  • 350
  • 2
  • 14
  • It says "replacement has 21739 rows, data has 977". In my original data, there are 977 rows. – Laura Mar 14 '17 at 20:40
  • I think you might want to add the year column to the Total_col, so that you can have a common column when merging everything back together. I used full_join from dpylr. See the edits above. – msubbaiah Mar 14 '17 at 20:47
  • The thing is that I already have a data frame with all columns. I have to add that new column to already existing data set. – Laura Mar 14 '17 at 20:53
  • Yep, it would be easier to add the new column to the existing dataset if you have a ID column in both. So my suggestion would be to put the new column in a separate data frame that has two columns (new column data, year identifier). Then when you merge with the existing data frame, the join function will merge along the year column (which should be the same in both) – msubbaiah Mar 14 '17 at 20:55
  • Sadly, this did not work out. Is there any other way to do it? – Laura Mar 14 '17 at 22:11