0

Is there an efficient way to unwrap rows in a r data frame to columns? This is an re-occurring problem I run into, when my data I get from an SQL script should be unwrapped into several columns. For example for timeseries forecasting where I don't have an rnn and I instead aim to use standard neural nets. Instead of having recurring nets I plan to flatten the data beforehand so that the net receives the row for t-1, t-2, t-3 etc. See my fancy paint-job below

enter image description here

Basically, for each row I want to concatenate n previous rows on the right, where n depends on how many previous times steps we want to use to predict values in the current row.

Mostly I'm looking for a smart and efficient way of doing it, preferably with existing libraries/functions in r. I can program it in several languages, but aim to find an r solution. I've done this in Java before (which was quite fast) and in r (which took ~forever=1 hour).

Currently I have a bit over 3000 rows and 10 columns. If for example I want to use 15 previous time steps we get 10+15*10 columns. Whether 15 is a good choice I don't know, and therefore I need to be able to quickly test e.g. n= 5, 10, 15, 20, 25, 50.

Edit

To be honest, I'm a beginner when it comes to r, therefore I ask for help instead of yet again coding my custom function for this.

dput gives:

structure(list(Date = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 
3L, 2L, 1L), .Label = c("6/10/2016", "6/13/2016", "6/14/2016", 
"6/15/2016", "6/16/2016", "6/17/2016", "6/20/2016", "6/21/2016", 
"6/22/2016", "6/23/2016"), class = "factor"), Bid = c(5.04, 4.97, 
4.96, 4.93, 4.84, 5.09, 5.05, 4.96, 5.08, 5), Ask = c(5.04, 4.97, 
4.96, 4.94, 4.84, 5.09, 5.06, 4.97, 5.08, 5.01), Opening.price = c(4.98, 
4.97, 4.95, 4.94, 4.92, 5.01, 5.01, 5.01, 4.95, 5.05), High.price = c(5.07, 
4.98, 4.97, 4.99, 4.93, 5.14, 5.06, 5.1, 5.13, 5.09), Low.price = c(4.94, 
4.91, 4.89, 4.92, 4.81, 5.01, 4.94, 4.94, 4.89, 4.97), Closing.price = c(5.04, 
4.97, 4.95, 4.94, 4.86, 5.08, 5.05, 4.94, 5.06, 4.98), Average.price = c(5.02, 
4.96, 4.94, 4.94, 4.87, 5.08, 5.01, 5, 5.01, 5.01), Total.volume = c(18997216L, 
17969939L, 21430529L, 20725035L, 66884495L, 32994371L, 24600829L, 
24439514L, 26540825L, 24756699L), Turnover = c(95382241.29, 89106913.2, 
105823382.96, 102379207.58, 325592595.95, 167697936.93, 123243137.11, 
122189815.88, 133063486.77, 124080799.95), Trades = c(9220L, 
9317L, 10075L, 10230L, 16446L, 13544L, 11888L, 10923L, 11981L, 
9696L)), .Names = c("Date", "Bid", "Ask", "Opening.price", "High.price", 
"Low.price", "Closing.price", "Average.price", "Total.volume", 
"Turnover", "Trades"), class = "data.frame", row.names = c(NA, 
-10L))

The result when n = 2 (append 2 previous timesteps on the right):

structure(list(Date = structure(c(8L, 7L, 6L, 5L, 4L, 3L, 2L, 
1L), .Label = c("6/14/2016", "6/15/2016", "6/16/2016", "6/17/2016", 
"6/20/2016", "6/21/2016", "6/22/2016", "6/23/2016"), class = "factor"), 
Bid = c(5.04, 4.97, 4.96, 4.93, 4.84, 5.09, 5.05, 4.96), 
Ask = c(5.04, 4.97, 4.96, 4.94, 4.84, 5.09, 5.06, 4.97), 
Opening.price = c(4.98, 4.97, 4.95, 4.94, 4.92, 5.01, 5.01, 
5.01), High.price = c(5.07, 4.98, 4.97, 4.99, 4.93, 5.14, 
5.06, 5.1), Low.price = c(4.94, 4.91, 4.89, 4.92, 4.81, 5.01, 
4.94, 4.94), Closing.price = c(5.04, 4.97, 4.95, 4.94, 4.86, 
5.08, 5.05, 4.94), Average.price = c(5.02, 4.96, 4.94, 4.94, 
4.87, 5.08, 5.01, 5), Total.volume = c(18997216L, 17969939L, 
21430529L, 20725035L, 66884495L, 32994371L, 24600829L, 24439514L
), Turnover = c(95382241.29, 89106913.2, 105823382.96, 102379207.58, 
325592595.95, 167697936.93, 123243137.11, 122189815.88), 
Trades = c(9220L, 9317L, 10075L, 10230L, 16446L, 13544L, 
11888L, 10923L), X1_Bid = c(4.97, 4.96, 4.93, 4.84, 5.09, 
5.05, 4.96, 5.08), X1_Ask = c(4.97, 4.96, 4.94, 4.84, 5.09, 
5.06, 4.97, 5.08), X1_Opening.price = c(4.97, 4.95, 4.94, 
4.92, 5.01, 5.01, 5.01, 4.95), X1_High.price = c(4.98, 4.97, 
4.99, 4.93, 5.14, 5.06, 5.1, 5.13), X1_Low.price = c(4.91, 
4.89, 4.92, 4.81, 5.01, 4.94, 4.94, 4.89), X1_Closing.price = c(4.97, 
4.95, 4.94, 4.86, 5.08, 5.05, 4.94, 5.06), X1_Average.price = c(4.96, 
4.94, 4.94, 4.87, 5.08, 5.01, 5, 5.01), X1_Total.volume = c(17969939L, 
21430529L, 20725035L, 66884495L, 32994371L, 24600829L, 24439514L, 
26540825L), X1_Turnover = c(89106913.2, 105823382.96, 102379207.58, 
325592595.95, 167697936.93, 123243137.11, 122189815.88, 133063486.77
), X1_Trades = c(9317L, 10075L, 10230L, 16446L, 13544L, 11888L, 
10923L, 11981L), X2_Bid = c(4.96, 4.93, 4.84, 5.09, 5.05, 
4.96, 5.08, 5), X2_Ask = c(4.96, 4.94, 4.84, 5.09, 5.06, 
4.97, 5.08, 5.01), X2_Opening.price = c(4.95, 4.94, 4.92, 
5.01, 5.01, 5.01, 4.95, 5.05), X2_High.price = c(4.97, 4.99, 
4.93, 5.14, 5.06, 5.1, 5.13, 5.09), X2_Low.price = c(4.89, 
4.92, 4.81, 5.01, 4.94, 4.94, 4.89, 4.97), X2_Closing.price = c(4.95, 
4.94, 4.86, 5.08, 5.05, 4.94, 5.06, 4.98), X2_Average.price = c(4.94, 
4.94, 4.87, 5.08, 5.01, 5, 5.01, 5.01), X2_Total.volume = c(21430529L, 
20725035L, 66884495L, 32994371L, 24600829L, 24439514L, 26540825L, 
24756699L), X2_Turnover = c(105823382.96, 102379207.58, 325592595.95, 
167697936.93, 123243137.11, 122189815.88, 133063486.77, 124080799.95
), X2_Trades = c(10075L, 10230L, 16446L, 13544L, 11888L, 
10923L, 11981L, 9696L)), .Names = c("Date", "Bid", "Ask", 
"Opening.price", "High.price", "Low.price", "Closing.price", 
"Average.price", "Total.volume", "Turnover", "Trades", "X1_Bid", 
"X1_Ask", "X1_Opening.price", "X1_High.price", "X1_Low.price", 
"X1_Closing.price", "X1_Average.price", "X1_Total.volume",  "X1_Turnover", 
"X1_Trades", "X2_Bid", "X2_Ask", "X2_Opening.price", "X2_High.price", 
"X2_Low.price", "X2_Closing.price", "X2_Average.price", "X2_Total.volume", 
"X2_Turnover", "X2_Trades"), class = "data.frame", row.names = c(NA, 
-8L))

edit 2

Here's somebody who's haven a question about rnns and how they are different from classic nets when the previous time points are input directly. rnn vs classic net Part 1 of the question explains how I want to get x(t-1), x(t-2)... on the right of the data I want to predict, for each row.

Community
  • 1
  • 1
user979899
  • 141
  • 1
  • 2
  • 14
  • 1
    [Reshape data from long to wide format](http://stackoverflow.com/questions/5890584) ? – zx8754 Jun 28 '16 at 16:14
  • Please use `dput` to show a small example instead of images. – akrun Jun 28 '16 at 16:18
  • `rbind(unlist(mtcars[-1,]), unlist(mtcars[-nrow(mtcars),]))`? – alistaire Jun 28 '16 at 16:19
  • I couldn't figure out how to use reshape for unwrapping in such a way that I could specify how many previous rows to concatenate on the right, i.e. how many previous time steps are of interest. The rbind solution did unwrap the data in some way, but it isn't clear how to specify the number of rows to concatenate. In any case, the function should not really "reshape" the data, it is more like duplication of information just to make each row an input for the net. – user979899 Jun 28 '16 at 18:23

1 Answers1

0
# I put your dataframe into x. 
# Then I extract just the numeric columns into x1,
# because operations on matrices tend to be faster 
# than operations on dataframes.
    x1<-as.matrix(x[,-1])
# Now I increase the size to a 3000 row matrix to match your comment
    x2<-x1[sample(1:10,3000,replace=TRUE),]
# record some sizes:
    N<-50; n<-nrow(x2); J<-1:(n-(N-1))
# Create the new numeric matrix:
    z<-x2[J,]; for (i in 1:(N-1)){z<-cbind(z,x2[i+J,])}
#This took an imperceptible length of time (< 1 second) on my MacBook Pro
    dim(z)
    [1] 2951  500
# Note that looping is inelegant at times but here it's just fine.
# appending columns to a matrix is fast because it goes 
# in the order in which matrices are stored.

# You could run your test cases with z[,1:50], z[,1:100] ... 
# to test the right number of past entries
C. Kent
  • 26
  • 3
  • Neat! Thank you! It seems cbind is what I've been looking for. I did not know it could be used like that (still need to figure why it does what it does). Still need to append an index for all the headers make a distinction between them, but that was not in the request. – user979899 Jun 30 '16 at 18:16