Restructuring Market Data in R

Question

I am very new to R and to stackoverflow. My data is being read into R as a csv file. I have figured out how to restructure currency 1 by itself within R, however, I am working with 900+ columns of data and need a way of looping the R script to apply what I have done to columns 1 through 7 to the other 900 columns.

Currently my data looks like this:

Currency 1                                                     Blank    Currency 2
Date        Contract    Last    Open   High    Low   Volume    Column   Date        Contract    Last    Open   High    Low   Volume
10/10/2012   Dec        100     101     105    99    20000     
10/11/2012   Dec        101     102     106    98    20100     
10/12/2012   Jan        102     103     107    97    20120

As you can see the data is sent to me horizontally. With a blank column in between each currency and I need the data stacked on top of each other.

I would like the data to look like this:

Date        Contract    Last    Open   High    Low   Volume    Market
10/10/2012   Dec        100     101     105    99    20000     Currency 1
10/11/2012   Dec        101     102     106    98    20100     Currency 1
10/12/2012   Jan        102     103     107    97    20120     Currency 1
10/10/2012   Dec        50      52      49     99    20530     Currency 2
10/11/2012   Dec        53      56      43     98    24300     Currency 2
10/12/2012   Jan        56      52      48     97    22320     Currency 2

Welcome to Stack Overflow. Part of the standard 'welcome pack' is to suggest new users read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) very useful post on how to ask a good question that others will be quick to answer. In particular, read the parts about using `dput` to attach data and (if you can't use your real data) the parts on generating 'fake' data. In your case you have several hundred columns, but if you can give us a subset e.g. 2 sets of columns and a dozen rows, that would make life easier for us. You'll get an answer. Good luck. — SlowLearner, Jan 26 '13 at 01:27

score 2 · Accepted Answer · answered Jan 26 '13 at 05:29

If I understand correctly, and if your source data really are nicely formatted you might be able to do something like the following. Here, I'm linking to a csv with three sets of currencies which replicates what I think your source data look like.

First, read the file in using read.csv but skip the first row. Use check.names = FALSE so that duplicate column names are allowed.

temp <- read.csv("http://ideone.com/plain/t3cGcA", 
                 header = TRUE, skip = 1,
                 check.names = FALSE)
temp
#         Date Contract Last Open High Low Volume          Date
# 1 10/10/2012      Dec  100  101  105  99  20000 NA 10/10/2012
# 2 10/11/2012      Dec  101  102  106  98  20100 NA 10/11/2012
# 3 10/12/2012      Jan  102  103  107  97  20120 NA 10/12/2012
#   Contract Last Open High Low Volume
# 1      Dec   50   52   49  99  20530
# 2      Dec   53   56   43  98  24300
# 3      Jan   56   52   48  97  22320
#   structure(c("NA", "NA", "NA"), class = "AsIs")       Date Contract
# 1                                             NA 10/10/2012      Dec
# 2                                             NA 10/11/2012      Dec
# 3                                             NA 10/12/2012      Jan
#   Last Open High Low Volume
# 1  500  501  605  99  20000
# 2  600  502  606  98  20100
# 3  700  503  607  97  20120

Second---and here is one assumption of the tidiness of your dataset---use seq to create a vector of where your blank columns are. From this, if our assumption of tidiness is correct, you can use simple math to determine the start (vector value minus 7) and end indexes (vector value minus 1) of each currency.

myblankcols <- seq(1, ncol(temp), by=8) + 7
myblankcols
# [1]  8 16 24

Using the simple math mentioned above, create a list of the subsets of each currency, and add names to the list. You can get the names by re-reading just the first line of the file as a csv and dropping all the NA values.

tempL <- lapply(seq_along(myblankcols), 
                function(x) temp[(myblankcols[x] - 7):(myblankcols[x] - 1)])
NamesTempL <- read.csv("http://ideone.com/plain/t3cGcA", 
                       header = FALSE, nrows = 1)
names(tempL) <- NamesTempL[!is.na(NamesTempL)]
tempL
# $`Currency 1`
#         Date Contract Last Open High Low Volume
# 1 10/10/2012      Dec  100  101  105  99  20000
# 2 10/11/2012      Dec  101  102  106  98  20100
# 3 10/12/2012      Jan  102  103  107  97  20120
# 
# $`Currency 2`
#         Date Contract Last Open High Low Volume
# 1 10/10/2012      Dec   50   52   49  99  20530
# 2 10/11/2012      Dec   53   56   43  98  24300
# 3 10/12/2012      Jan   56   52   48  97  22320
# 
# $`Currency 3`
#         Date Contract Last Open High Low Volume
# 1 10/10/2012      Dec  500  501  605  99  20000
# 2 10/11/2012      Dec  600  502  606  98  20100
# 3 10/12/2012      Jan  700  503  607  97  20120

I am usually tempted to stop at this point, because I find lists convenient for many purposes. But, it's equally easy to convert it to a single data.frame. This is also one of the reasons to make sure you use check.names = FALSE in the first step: if all the columns have the same name, then there will be no problem rbinding them together.

do.call(rbind, tempL)
#                    Date Contract Last Open High Low Volume
# Currency 1.1 10/10/2012      Dec  100  101  105  99  20000
# Currency 1.2 10/11/2012      Dec  101  102  106  98  20100
# Currency 1.3 10/12/2012      Jan  102  103  107  97  20120
# Currency 2.1 10/10/2012      Dec   50   52   49  99  20530
# Currency 2.2 10/11/2012      Dec   53   56   43  98  24300
# Currency 2.3 10/12/2012      Jan   56   52   48  97  22320
# Currency 3.1 10/10/2012      Dec  500  501  605  99  20000
# Currency 3.2 10/11/2012      Dec  600  502  606  98  20100
# Currency 3.3 10/12/2012      Jan  700  503  607  97  20120

I'll definitely stop here, but from here, you probably want to convert your "Date" column to actual columns, and perhaps convert the row names ("Currency 1.1", "Currency 1.2", and so on) to a column in your data.frame.

Thank you very much for you help Ananda! This worked great. I was going about this the completely wrong way. — Tim, Jan 28 '13 at 21:11

Restructuring Market Data in R

1 Answers1