Summing every 4 columns together (convert quarterly data to annual on a large dataset)

Question

I have a (large) dataframe of the form:

Variable    Country         2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
Var1        AR:Argentina    69.8    67.3    65  63.6    60.4    56.6    54.4    57.3
Var2        AR:Argentina    191.298 196.785 196.918 207.487 209.596 219.171 216.852 213.124
Var3        AR:Argentina    594.67  606.157 620.783 652.59  662.784 663.191 676.188 735.065
Var4        AR:Argentina    49.6    47.5    45.2    44.4    41.7    38.7    36.8    39.3
Var5        AR:Argentina    135.971 138.885 137.005 144.903 144.757 149.803 146.492 146.102
Var6        AR:Argentina    422.679 427.808 431.909 455.75  457.752 453.288 456.791 503.906
Var8        AR:Argentina    9.657   10.755  11.856  13.267  14.47   16.523  16.727  16.235

Essentially, every row has 4 columns of data for each year, sorted by quarter. I want to turn this into annual data.

One way to do this is to simply sum up every 4 columns (so 2008-Q1:2008-Q4 would be summed, for example). Another way, I suppose, would be to filter columns that share a common year in (2007-** or something) and then run RowSums() on them but this sounds like it would be much more inefficient.

I'm hoping to get back a data frame that looks like:

Variable Country 2007 2008
Var1 AR:Argentina SUMXX SUMXX

Or even better:

Country Year Var1 Var2
AR:Argentina 2007 SUMXX SUMYY
AR:Argentina 2008 SUMXX SUMYY.

The second format would be much preferred but the first format is fine as well. The main thing is that I need to be able to select the data for a variable, for a country, for all years -relatively easily. If I can select for all countries all years for any given variable - even better (second format).

Is there any easy way to accomplish this, aside from running a nested loop etc?

You could try `split.default`, see this one: https://stackoverflow.com/a/24024683/3926543 — mt1022, Jun 30 '17 at 04:34
For this type of dataframe, you can use `sapply(seq(2,ncol(df),by=4),function(x) sum(df[,(x:(x+3))]))` — user2100721, Jun 30 '17 at 04:35

Ajay Ohri · Answer 1 · 2017-06-30T04:47:30.563

2

How about transposing data t(df) to convert columns into rows and them use moving average

Calculating moving average

library(TTR)
sma(AR:Argentina,4)

for faster moving averages you may want to see Rccp options at link above

another strategy is

a=ncol(df)

then summing

df[1:4]
df[5:8]

till

df[a-3,a]

y=NULL
for (i in 1:a-3){
y[i]=sum(df[i:i+3]
} 

Pseudocode

edited Jun 30 '17 at 04:47

answered Jun 30 '17 at 04:36

Ajay Ohri

3,382
3
30
60

P.R · Accepted Answer · 2017-06-30T18:11:54.447

I don't really understand what 'AR:Argentina' means but I was able to get the column sums you want. I added an id column to identify the rows.

library(tidyr)
library(dplyr)
df <- data%>%gather(year_quarter,value,-X1,-id)%>%
                            mutate(year = sapply(strsplit(year_quarter,"-"),`[`, 1))%>%
                            group_by(id,year)%>%
                            summarise(yearly_total = sum(value))

The result looks like this:

> df
 Source: local data frame [12 x 3]
 Groups: id [?]

       id  year yearly_total
      <int> <chr>        <dbl>
 1      1  2007      792.488
 2      1  2008      858.743
 3      2  2007     2474.200
 4      2  2008     2737.228
 5      3  2007      186.700
 6      3  2008      156.500
 7      4  2007      556.764
 8      4  2008      587.154
 9      5  2007     1738.146
 10     5  2008     1871.737
 11     6  2007       45.535
 12     6  2008       63.955

id=1 refers to row=1 of your data frame and 792.488 is the colsum of the four quarters of 2007 and so on.

This is the edited version:

library(tidyr)
library(dplyr)
df <- data%>%gather(year_quarter,value,-Country,-Variable)%>%
                   mutate(year = sapply(strsplit(year_quarter,"-"),`[`, 1))%>%
                   group_by(Variable,year)%>%
                   summarise(yearly_total = sum(value))%>%
                   spread(year,yearly_total)


df2 <- data%>%select(Variable,Country)%>%
                  left_join(df,"Variable")

The output is:

> df2
  # A tibble: 6 × 4
   Variable      Country   `2007`   `2008`
     <chr>        <chr>    <dbl>    <dbl>
1     var1 AR:Argentina  792.488  858.743
2     var2 AR:Argentina 2474.200 2737.228
3     var3 AR:Argentina  186.700  156.500
4     var4 AR:Argentina  556.764  587.154
5     var5 AR:Argentina 1738.146 1871.737
6     var6 AR:Argentina   45.535   63.955

I was able to get it in the first format. The left_join is required to get the Country column.

Also, this is what the data looks like when I import it in R:

> data
# A tibble: 6 × 10
     Variable      Country `2007-Q1` `2007-Q2` `2007-Q3` `2007-Q4` `2008-Q1` `2008-Q2` `2008-Q3`
      <chr>        <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1     var1 AR:Argentina   191.298   196.785   196.918   207.487   209.596   219.171   216.852
 2     var2 AR:Argentina   594.670   606.157   620.783   652.590   662.784   663.191   676.188
 3     var3 AR:Argentina    49.600    47.500    45.200    44.400    41.700    38.700    36.800
 4     var4 AR:Argentina   135.971   138.885   137.005   144.903   144.757   149.803   146.492

ok so each row represents a different variable (I have reflected my sample data to reflect this). Is there any way to maintain that format? Also I have more than one country - hundreds of them actually. Is there anyway to adapt this solution to return a data frame of the same format? So basically it would return something like this: `Variable Country 2007 2008` `Var1 AR:Argentina SUMXX SUMXX` Or perhaps even: `Country Year Var1 Var2` `AR:Argentina 2007 SUMXX SUMYY` `AR:Argentina 2008 SUMXX SUMYY`. — Atom Vayalinkal, Jun 30 '17 at 15:46
Couldn't represent it properly in the above comment but please check my edit to the question. Thank you so much for your help — Atom Vayalinkal, Jun 30 '17 at 15:50
When I attempt this solution, I keep getting: `Error in overscope_eval_next(overscope, expr) : object 'X1' not found` — Atom Vayalinkal, Jun 30 '17 at 16:16
@jackson5: X1 is the column header when I import the data frame you originally posted. Since the 'AR: Argentina' column did not have a header R automatically chose X1 as the header. — P.R, Jun 30 '17 at 17:29

Summing every 4 columns together (convert quarterly data to annual on a large dataset)

2 Answers2