Excel or R: Preparing time series from multiple sources?

Question

Lately I often had to handle time series data from multiple .csv sources in the same analysis. Let's assume for simplicity that all series are regular quarterly series (no missing values in between). Typically the original .csv data contains a date column plus 1-3 variables. Unfortunately the series are not of equal length across .csv files.

I started to organize my dataset in R and ended up with a big mess containing lots of window()commands. Plus I had to concatenate NAs and original series before turning them into ts()objects because I found concatenating (multivariate) ts()objects so counter-intuitive. Note that the reason why I added NAs is that I wanted to all series to be of the same length. Of course I could have trimmed the longer ones, but then I´d eventually loose observations when not using shorter series.

I thought about writing a function that reads .csv files and uses it's date column to create ts()objects and maybe with another function merge all the single series to create a multivariate series containing NAs when data is missing. I found myself switching data types all the time, reading through the ts and zoo manuals – i just could not believe it was that complex.

I really thought this problem is really common and thought about the preparations in excel.. I mean I really hate excel, but this time I wonder what more experienced useRs do? R or Excel?

EDIT: added some exemplary data (need to aggregate daily data) file1:

27.05.11;5965.95
26.05.11;5947.06
25.05.11;5942.82
24.05.11;5939.98

file2 (without date col, but i know start and frequency)

Germany;Switzerland;USA;OECDEurope
69,90974;61,8241;55,60966;64,96157
67,0394;62,18966;56,47361;64,15152
70,56651;63,6347;56,87237;65,43568

file3:

1984-04-01,33.3238396624473
1984-07-01,63.579833082501
1984-10-01,35.8375401560349

I admit exemplary data does help to illustrate the question, but it`s rather a best practice type of question adressing more experienced users than myself. How do you prepare your data for multivariate ts analysis ?

I mean I do understand that a regular series does not need date column if start and frequency are known, but part of the question rather is: is there a standard way of create `ts` or `zoo` objects from csv with date columns? — Matt Bannert, Jun 05 '11 at 10:12
You may have to give some examples of your data files, as `read.zoo` should handle some of these issues. — Henry, Jun 05 '11 at 11:02

Joshua Ulrich · Accepted Answer · 2011-06-06T23:00:43.927

I do this in R all the time. You may find it easier to do in Excel but if your data change, you have to do the same process again. Using R makes it much easier to update and reproduce your results.

Dealing with monthly or quarterly frequencies are made significantly easier with zoo's yearmon and yearqtr index classes, respectively. Once you have your data in zoo objects with yearqtr indexes, all you have to do is merge all the objects.

Here's your sample data:

Lines1 <-
"27.05.11;5965.95
26.05.11;5947.06
25.05.11;5942.82
24.05.11;5939.98"
f1 <- read.csv2(con <- textConnection(Lines1), header=FALSE)
close(con)

Lines2 <-
"Germany;Switzerland;USA;OECDEurope
69,90974;61,8241;55,60966;64,96157
67,0394;62,18966;56,47361;64,15152
70,56651;63,6347;56,87237;65,43568"
f2 <- read.csv2(con <- textConnection(Lines2), header=TRUE)
close(con)

Lines3 <-
"1984-04-01,33.3238396624473
1984-07-01,63.579833082501
1984-10-01,35.8375401560349"
f3 <- read.csv(con <- textConnection(Lines3), header=FALSE)
close(con)

The example below assumes the starting date for the first file is 1984Q2 and the starting date for the second file is 1984Q4. You can see that merge.zoo takes care of aligning all the dates for you. After everything is aligned in your zoo object, you can use the as.ts method to create a mts object.

z1 <- zoo(f1[,-1], as.Date(f1[,1], "%d.%m.%y"))
z2 <- zoo(f2, as.yearqtr("1984Q4")+(seq_len(NROW(f1))-1)/4)
z3 <- zoo(f3[,-1], as.yearqtr(as.Date(f3[,1])))

library(xts)
# Use xts::apply.quarterly to aggregate series with higher periodicity.
# Here I just take the last obs but you could use another function (e.g. mean).
z1 <- apply.quarterly(z1, last)
index(z1) <- as.yearqtr(index(z1))  # convert the index to yearqtr

(Z <- merge(z1,z2,z3))
#         z1      Germany  Switzerland USA      OECDEurope z3
# 1984 Q2 <NA>    <NA>     <NA>        <NA>     <NA>       33.32383
# 1984 Q3 <NA>    <NA>     <NA>        <NA>     <NA>       63.57983
# 1984 Q4 <NA>    69.90974 61.8241     55.60966 64.96157   35.83754
# 1985 Q1 <NA>    67.0394  62.18966    56.47361 64.15152   <NA>
# 1985 Q2 <NA>    70.56651 63.6347     56.87237 65.43568   <NA>
# 1985 Q3 <NA>    69.90974 61.8241     55.60966 64.96157   <NA>
# 2011 Q2 5965.95 <NA>     <NA>        <NA>     <NA>       <NA>

# Note that ts will create an object with a observation for every period,
# even if all the columns are missing.
TS <- as.ts(Z)

+1 I´d take R over Excel 10 out of 10 times, but somehow I got lost here and lost confidence. Feelin good again! thx. — Matt Bannert, Jun 06 '11 at 02:57
In what format do you save the resulting mts typically in order to not loose attributes? — Matt Bannert, Jun 06 '11 at 08:42
Note that Joshua's code does not aggregate the first series. The first column is a (daily) date column, too – even if it's an ugly euro format.. — Matt Bannert, Jun 06 '11 at 10:06
Whoops, I thought all the series were quarterly. I didn't know you wanted/needed to aggregate some. See my edit for how to do that. — Joshua Ulrich, Jun 06 '11 at 22:54

score 7 · Answer 2 · answered Jun 05 '11 at 14:36

My strategy for problems of this type is:

Read each data source into a standard data.frame
Clean each data.frame, i.e. get data into desired format, process missing values, etc.
Merge or join into standard data.frame
Perform any aggregate data cleanup, e.g. adding blank lines, removing duplicates, etc.
Only then pass data to the next step, such as conversion to ts object, plot it, etc.

Using your example data:

v1 <- "27.05.11;5965.95
26.05.11;5947.06
25.05.11;5942.82
24.05.11;5939.98"

v2 <- "Germany;Switzerland;USA;OECDEurope
69,90974;61,8241;55,60966;64,96157
67,0394;62,18966;56,47361;64,15152
70,56651;63,6347;56,87237;65,43568"


v3 <- "1984-04-01,33.3238396624473
1984-07-01,63.579833082501
1984-10-01,35.8375401560349"

# Read and clean data
dat1 <- read.table(textConnection(v1), header=FALSE, sep=";", dec=".")
names(dat1) <- c("date", "V1")
dat1$date <- as.Date(dat1$date, format="%d.%m.%y")
dat1

dat2 <- read.table(textConnection(v2), header=TRUE, sep=";", dec=",")
dat2$date <- seq(as.Date("2011/1/1"), by="3 months", length.out=3)
dat2

dat3 <- read.table(textConnection(v3), header=FALSE, sep=",", dec=".")
names(dat3) <- c("date", "V2")
dat3$date <- as.Date(dat3$date)
dat3

# Merge separate data.frames.
# I use join() in package plyr, you may wish to use merge(), rbind.fill, etc
library(plyr)
join(join(dat1, dat2, type="full"), dat3, type="full")

The results:

         date      V1  Germany Switzerland      USA OECDEurope       V2
1  2011-05-27 5965.95       NA          NA       NA         NA       NA
2  2011-05-26 5947.06       NA          NA       NA         NA       NA
3  2011-05-25 5942.82       NA          NA       NA         NA       NA
4  2011-05-24 5939.98       NA          NA       NA         NA       NA
5  2011-01-01      NA 69.90974    61.82410 55.60966   64.96157       NA
6  2011-04-01      NA 67.03940    62.18966 56.47361   64.15152       NA
7  2011-07-01      NA 70.56651    63.63470 56.87237   65.43568       NA
8  1984-04-01      NA       NA          NA       NA         NA 33.32384
9  1984-07-01      NA       NA          NA       NA         NA 63.57983
10 1984-10-01      NA       NA          NA       NA         NA 35.83754

Excel or R: Preparing time series from multiple sources?

2 Answers2

Linked

Related