2

I have a (large) dataframe of the form:

Variable    Country         2007-Q1 2007-Q2 2007-Q3 2007-Q4 2008-Q1 2008-Q2 2008-Q3 2008-Q4
Var1        AR:Argentina    69.8    67.3    65  63.6    60.4    56.6    54.4    57.3
Var2        AR:Argentina    191.298 196.785 196.918 207.487 209.596 219.171 216.852 213.124
Var3        AR:Argentina    594.67  606.157 620.783 652.59  662.784 663.191 676.188 735.065
Var4        AR:Argentina    49.6    47.5    45.2    44.4    41.7    38.7    36.8    39.3
Var5        AR:Argentina    135.971 138.885 137.005 144.903 144.757 149.803 146.492 146.102
Var6        AR:Argentina    422.679 427.808 431.909 455.75  457.752 453.288 456.791 503.906
Var8        AR:Argentina    9.657   10.755  11.856  13.267  14.47   16.523  16.727  16.235

Essentially, every row has 4 columns of data for each year, sorted by quarter. I want to turn this into annual data.

One way to do this is to simply sum up every 4 columns (so 2008-Q1:2008-Q4 would be summed, for example). Another way, I suppose, would be to filter columns that share a common year in (2007-** or something) and then run RowSums() on them but this sounds like it would be much more inefficient.

I'm hoping to get back a data frame that looks like:

Variable Country 2007 2008
Var1 AR:Argentina SUMXX SUMXX

Or even better:

Country Year Var1 Var2
AR:Argentina 2007 SUMXX SUMYY
AR:Argentina 2008 SUMXX SUMYY.

The second format would be much preferred but the first format is fine as well. The main thing is that I need to be able to select the data for a variable, for a country, for all years -relatively easily. If I can select for all countries all years for any given variable - even better (second format).

Is there any easy way to accomplish this, aside from running a nested loop etc?

Atom Vayalinkal
  • 2,642
  • 7
  • 29
  • 37

2 Answers2

2

How about transposing data t(df) to convert columns into rows and them use moving average

Calculating moving average

library(TTR)
sma(AR:Argentina,4)

for faster moving averages you may want to see Rccp options at link above

another strategy is

a=ncol(df)

then summing

df[1:4]
df[5:8]

till

df[a-3,a]

y=NULL
for (i in 1:a-3){
y[i]=sum(df[i:i+3]
} 

Pseudocode
Ajay Ohri
  • 3,382
  • 3
  • 30
  • 60
2

I don't really understand what 'AR:Argentina' means but I was able to get the column sums you want. I added an id column to identify the rows.

library(tidyr)
library(dplyr)
df <- data%>%gather(year_quarter,value,-X1,-id)%>%
                            mutate(year = sapply(strsplit(year_quarter,"-"),`[`, 1))%>%
                            group_by(id,year)%>%
                            summarise(yearly_total = sum(value))

The result looks like this:

> df
 Source: local data frame [12 x 3]
 Groups: id [?]

       id  year yearly_total
      <int> <chr>        <dbl>
 1      1  2007      792.488
 2      1  2008      858.743
 3      2  2007     2474.200
 4      2  2008     2737.228
 5      3  2007      186.700
 6      3  2008      156.500
 7      4  2007      556.764
 8      4  2008      587.154
 9      5  2007     1738.146
 10     5  2008     1871.737
 11     6  2007       45.535
 12     6  2008       63.955

id=1 refers to row=1 of your data frame and 792.488 is the colsum of the four quarters of 2007 and so on.

This is the edited version:

library(tidyr)
library(dplyr)
df <- data%>%gather(year_quarter,value,-Country,-Variable)%>%
                   mutate(year = sapply(strsplit(year_quarter,"-"),`[`, 1))%>%
                   group_by(Variable,year)%>%
                   summarise(yearly_total = sum(value))%>%
                   spread(year,yearly_total)


df2 <- data%>%select(Variable,Country)%>%
                  left_join(df,"Variable")

The output is:

> df2
  # A tibble: 6 × 4
   Variable      Country   `2007`   `2008`
     <chr>        <chr>    <dbl>    <dbl>
1     var1 AR:Argentina  792.488  858.743
2     var2 AR:Argentina 2474.200 2737.228
3     var3 AR:Argentina  186.700  156.500
4     var4 AR:Argentina  556.764  587.154
5     var5 AR:Argentina 1738.146 1871.737
6     var6 AR:Argentina   45.535   63.955

I was able to get it in the first format. The left_join is required to get the Country column.

Also, this is what the data looks like when I import it in R:

> data
# A tibble: 6 × 10
     Variable      Country `2007-Q1` `2007-Q2` `2007-Q3` `2007-Q4` `2008-Q1` `2008-Q2` `2008-Q3`
      <chr>        <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1     var1 AR:Argentina   191.298   196.785   196.918   207.487   209.596   219.171   216.852
 2     var2 AR:Argentina   594.670   606.157   620.783   652.590   662.784   663.191   676.188
 3     var3 AR:Argentina    49.600    47.500    45.200    44.400    41.700    38.700    36.800
 4     var4 AR:Argentina   135.971   138.885   137.005   144.903   144.757   149.803   146.492
P.R
  • 300
  • 1
  • 7
  • ok so each row represents a different variable (I have reflected my sample data to reflect this). Is there any way to maintain that format? Also I have more than one country - hundreds of them actually. Is there anyway to adapt this solution to return a data frame of the same format? So basically it would return something like this: `Variable Country 2007 2008` `Var1 AR:Argentina SUMXX SUMXX` Or perhaps even: `Country Year Var1 Var2` `AR:Argentina 2007 SUMXX SUMYY` `AR:Argentina 2008 SUMXX SUMYY`. – Atom Vayalinkal Jun 30 '17 at 15:46
  • Couldn't represent it properly in the above comment but please check my edit to the question. Thank you so much for your help – Atom Vayalinkal Jun 30 '17 at 15:50
  • When I attempt this solution, I keep getting: `Error in overscope_eval_next(overscope, expr) : object 'X1' not found` – Atom Vayalinkal Jun 30 '17 at 16:16
  • @jackson5: X1 is the column header when I import the data frame you originally posted. Since the 'AR: Argentina' column did not have a header R automatically chose X1 as the header. – P.R Jun 30 '17 at 17:29