Transactions reshaping from long to wide, joining Buy and Sell dataframes

Question

I have buy and sell transactions in long format and I want to transform it to wide format. Look at example:

enter image description here

For every BUY transaction of some ticker must exist SELL transaction of the same ticker which closes position. If SELL transaction don't exists or shares count becomes zero then put NA at sell price.

Explanation:

We bought 100 shares of AIG ticker at price 34.56. Next we must find exit (SELL) transaction for BUY transaction of the same ticker AIG. This transaction exists below with 600 shares. So we close our AIG BUY transaction with 100 shares, decreasing shares of SELL transaction from 600 to 500 and write this transaction in wide format with buy price and sell price.

Next transaction is GOOG. For this ticker we found two SELL transactiosn and write them all in wide format, but 100 shares was unsold, so we put this transaction as "unfinished" with NA in sell price.

If necessary, I can put algorithm in pseudocode later. But I hope, my explanation is clear.

My question is following: It's easy to do it in R with clean and vectorized code? This algorithm is pretty easy to program in imperative-paradigm languages, like C++. But with R I have troubles.

EDIT 1: Added input and output data frames for R:

inputDF1 <- data.frame(Ticker = c("AIG", "GOOG", rep("AIG", 3), rep("GOOG", 2), rep("NEM", 3)), Side = c(rep("BUY", 4), rep("SELL", 3), "BUY", rep("SELL", 2)), Shares = c(100, 400, 200, 400, 600, 200, 100, 100, 50, 50), Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461, 45, 56, 78))
inputDF2 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 200, 300, 200, 100), Price = c(34, 35, 36, 457, 458, 459))
inputDF3 <- data.frame(Ticker = c(rep("AIG", 3), rep("GOOG", 3)), Side = c(rep("BUY", 2), "SELL", "BUY", rep("SELL", 2)), Shares = c(100, 100, 100, 300, 100, 100), Price = c(34, 35, 36, 457, 458, 459))

outputDF1 <- data.frame(Ticker = c("AIG", rep("GOOG", 3), rep("AIG", 3), rep("NEM", 2)), Side = rep("BUY", 9), Shares = c(100, 200, 100, 100, 200, 300, 100, 50, 50), BuyPrice = c(34.56, 457, 457, 457, 28.56, 24.65, 24.65, 45, 45), SellPrice = c(30.02, 460, 461, NA, 30.02, 30.02, NA, 56, 78))
outputDF2 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 2)), Side = rep("BUY", 4), Shares = c(100, 100, 200, 100), BuyPrice = c(34, 35, 457, 457), SellPrice = c(36, 36, 458, 459))
outputDF3 <- data.frame(Ticker = c(rep("AIG", 2), rep("GOOG", 3)), Side = rep("BUY", 5), Shares = rep(100, 5), BuyPrice = c(34, 35, rep(457, 3)), SellPrice = c(36, NA, 458, 459, NA))

EDIT 2: Updated example and input/output data for R

This looks familiar. Didn't you just ask this? Yes, you did. If this is different from the linked duplicate, please edit to show how. — Matthew Lundberg, Jun 01 '14 at 14:50
@MatthewLundberg Yes, you're right. This is more complicated version now. — Eldar Agalarov, Jun 01 '14 at 14:52
Oh yes, you're right. You need to match the volumes between the buy and sell sides. I'll retract the duplicate vote, but please edit the question to include data that R can read. — Matthew Lundberg, Jun 01 '14 at 14:54
I am guessing you would want this to be flexible to a situation where the number of shares sold was less than the first lot bought? (e.g. First buy 100 shares of AIG at 34.56, then sell 50 at 37 and sell the other 50 at 39) — Mike.Gahan, Jun 01 '14 at 15:40
@Mike.Gahan Yes, then must be 2 wide rows: [AIG, BUY, 50, 34.56, 37] and [AIG, BUY, 50, 34.56, 39]. — Eldar Agalarov, Jun 01 '14 at 15:45

score 3 · Answer 1 · edited May 23 '17 at 12:16

Original Answer (while the question was still in development and I wasn't paying close enough attention)

Use dcast from reshape2:

> t <- c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG")
> sd <- c(rep("BUY", 4), rep("SELL", 3))
> sh <- c(100, 400, 200, 400, 600, 200, 100)
> pr <- c(34.56, 457, 28.56, 24.65, 30.02, 460, 461)
> df <- data.frame(Ticker = t, Side = sd, Shares = sh, Price = pr)
> 
> library(reshape2)
> df
  Ticker Side Shares  Price
1    AIG  BUY    100  34.56
2   GOOG  BUY    400 457.00
3    AIG  BUY    200  28.56
4    AIG  BUY    400  24.65
5    AIG SELL    600  30.02
6   GOOG SELL    200 460.00
7   GOOG SELL    100 461.00
> dcast(df, Ticker*Shares ~ Side, value.var="Price")
  Ticker Shares    BUY   SELL
1    AIG    100  34.56     NA
2    AIG    200  28.56     NA
3    AIG    400  24.65     NA
4    AIG    600     NA  30.02
5   GOOG    100     NA 461.00
6   GOOG    200     NA 460.00
7   GOOG    400 457.00     NA

New Answer

The key sticking point here is that "vector-based" in R is often tied to "functional" (e.g. the apply() family), but a purely functional approach doesn't quite work here, because you have to update the sell list for every (part of each) buy transaction. I really feel like you could do something magical with aggregate or by and a carefully designed function, but the best readable solution that came to me involves a simple for-loop.

Version with `for`

inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"), 
                      Side = c(rep("BUY", 4), rep("SELL", 3)), 
                      Shares = c(100, 400, 200, 400, 600, 200, 100), 
                      Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL

# go through every buy operation
for(i in 1:nrow(buys)){
    ticker <- buys[i,"Ticker"]
    bp <- buys[i,"Price"]
    shares <- buys[i,"Shares"]

    # keep going as long as we can find sellers
    while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
        sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
        if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
            shares.sold <- shares
        }else{
            shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
        }
        shares <- shares - shares.sold
        sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares.sold
                                                      ,"BuyPrice"=bp
                                                   ,"SellPrice"=sp))
    }
    # not enough sellers
    if(shares > 0){ 
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                  ,"Side"="BUY"
                                                  ,"Shares"=shares
                                                  ,"BuyPrice"=bp
                                                  ,"SellPrice"="NA"))

    }

}

print(transactions)

Output:

  Ticker Side Shares BuyPrice SellPrice
1    AIG  BUY    100    34.56     30.02
2   GOOG  BUY    200   457.00       460
3   GOOG  BUY    100   457.00       461
4   GOOG  BUY    100   457.00        NA
5    AIG  BUY    200    28.56     30.02
6    AIG  BUY    300    24.65     30.02
7    AIG  BUY    100    24.65        NA

The updating becomes obvious if we try to use the foreach package to automagically parallelize the loop. It quickly becomes apparent that we have a race condition on the sell data frame.

Version with `apply`

There are a couple of inefficiencies in the code above that could be improved upon. The append operation via rbind() isn't terribly efficient and could probably be optimized a bit, either reducing the number of calls to rbind() or eliminating it all together. You could also pack everything into a function and convert it to a call to apply(), which does tend to be faster even for serial apply() because the looping is done at a more optimized level. (The same is true for CPython -- list comprehensions and str.join() are much faster than for loops because they're "more aware" of the total size of the operation and because they're written in optimized C.) Here's a first attempt -- note that we use do.call(rbind, list(...)) to simplify the list of small data frames we get back from the original call to apply. This isn't terribly efficient (rbindlist from data.table is significantly faster, see here), but it doesn't have any external dependencies. The list you get back from apply() is actually interesting in its own way -- every element is the list of transactions you needed to do complete one entire buy operation. If you added row names to the buys data frame, then you could call up each set of transactions by name.

inputDF <- data.frame(Ticker = c("AIG", "GOOG", "AIG", "AIG", "AIG", "GOOG", "GOOG"), 
                      Side = c(rep("BUY", 4), rep("SELL", 3)), 
                      Shares = c(100, 400, 200, 400, 600, 200, 100), 
                      Price = c(34.56, 457, 28.56, 24.65, 30.02, 460, 461))
buys <- subset(inputDF,Side=="BUY")
sells <- subset(inputDF,Side=="SELL")
transactions <- NULL

# go through every buy operation
buy.operation <- function(x){
    ticker <- x["Ticker"]
    # apply() converts to matix implicity, and all the elements of a matrix have
    # have the same data type, so everything gets converted to characters
    # thus, we need to convert back
    bp <- as.numeric(x["Price"])
    shares <- as.numeric(x["Shares"])

    # keep going as long as we can find sellers
    while(shares > 0 & sum(sells[sells$Ticker == ticker,"Shares"]) > 0){
        sp <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Price"]
        if(sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"] > shares){
            shares.sold <- shares
        }else{
            shares.sold <- sells[sells$Ticker == ticker & sells$Shares > 0,][1,"Shares"]
        }
        shares <- shares - shares.sold
        sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] <- sells[sells$Shares >= shares & sells$Ticker == ticker,][1,"Shares"] - shares.sold
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares.sold
                                                      ,"BuyPrice"=bp
                                                      ,"SellPrice"=sp))
    }
    # not enough sellers
    if(shares > 0){ 
        transactions <- rbind(transactions,data.frame("Ticker"=ticker
                                                      ,"Side"="BUY"
                                                      ,"Shares"=shares
                                                      ,"BuyPrice"=bp
                                                      ,"SellPrice"="NA"))

    }

    transactions
}

transactions <- do.call(rbind, apply(buys,1,buy.operation) )
# get rid of weird row names
row.names(transactions) <- NULL
print(transactions)

Output:

  Ticker Side Shares BuyPrice SellPrice
1    AIG  BUY    100    34.56     30.02
2   GOOG  BUY    200   457.00       460
3   GOOG  BUY    100   457.00       461
4   GOOG  BUY    100   457.00        NA
5    AIG  BUY    200    28.56     30.02
6    AIG  BUY    400    24.65     30.02

Unfortunately, the final incomplete AIG transaction is missing. I haven't figured out quite yet how to fix that.

Split it up into two dcast operations and use a left-merge via `plyr`. Update coming ... — Livius, Jun 01 '14 at 15:36
Seems good, but solution with for works on 1 and 2 tests, but not on 3-rd. I'll try to check why. — Eldar Agalarov, Jun 01 '14 at 20:16
The question has become a bit of a moving target. Besides, the idea is to address the deeper issues concerning why it's non trivial to write this as a vector/parellel operation and not to write code for somebody else. It appears like one of the loop conditions isn't quite right, but then again you really also need to optimize for the buy and sell prices -- you also need to try to find the appropriate extrema to maximize your profit. The code I posted works on a first-come, first-serve basis. — Livius, Jun 01 '14 at 20:17
@EldarAgalarov I like the self-initiative in the modified comment. :-) — Livius, Jun 01 '14 at 20:17

score 2 · Accepted Answer · edited May 23 '17 at 12:16

Hm, I spent way too much time on this question! Here's my attempt (with data.table).

Since you don't mention anything about your real data dimensions, I've not been able to optimise it any further. It'd be nice if you could run this on your real dataset and write back your findings (reg. speed/scaling).

First we've to split the dataset by Side and perform a join. This is the most straightforward approach. I also see that @Mike.Gahan has attempted along this route as well.

require(data.table)
dt1 <- as.data.table(inputDF1)
d1 <- dt1[Side == "BUY"][, N := .N > 1L, by=Ticker]
d2 <- dt1[Side == "SELL"]
setkey(d2, Ticker)
ans = d2[d1, allow.cartesian=TRUE][, Side := NULL]

Note that allow.cartesian does not perform a cartesian join. It's used very loosely here. Read ?data.table for more info or check this post on what it's for. The join, basically, will be really fast and will scale really well. This is not a limiting step.

We now set the column order and names accordingly:

setcolorder(ans, c("Ticker", "Side.1", "Shares.1", "Shares", "Price.1", "Price", "N"))
setnames(ans, c("Ticker", "Side", "Shares", "tmp", "BuyPrice", "SellPrice", "N"))

We interchange Shares and tmp so that Shares reflects the actual output we expect, based on the value of N as follows:

ans[, c("Shares", "tmp") := if (!N[1L]) 
             { val = Shares[1L]; list(tmp, val) }, by = Ticker]

We'll need a couple of parameters to aggregate and get the final result:

ans[, `:=`(N2= rep(c(FALSE, TRUE), c(.N-1L, 1L)), 
           csum = sum(Shares)), by = Ticker][, N2 := !(N2 * (csum != tmp))]

Finally,

ans1 = ans[(N2)][, c("N", "N2", "tmp", "csum") := NULL]
ans2 = ans[!(N2)][, N := N * 1L]
if (nrow(ans2) > 0) {
    ans2 = ans2[,  list("BUY", if (N[1L]) c(Shares+tmp-csum, csum-tmp) 
             else c(Shares, tmp-csum), BuyPrice, c(SellPrice, NA)), by=Ticker]
}
ans  = rbindlist(list(ans1, ans2))

#    Ticker Side Shares BuyPrice SellPrice
# 1:    AIG  BUY    100    34.56     30.02
# 2:   GOOG  BUY    200   457.00    460.00
# 3:    AIG  BUY    200    28.56     30.02
# 4:    NEM  BUY     50    45.00     56.00
# 5:    NEM  BUY     50    45.00     78.00
# 6:   GOOG  BUY    100   457.00    461.00
# 7:   GOOG  BUY    100   457.00        NA
# 8:    AIG  BUY    300    24.65     30.02
# 9:    AIG  BUY    100    24.65        NA

My guess is that this should be plentiful fast. But, it might be possible to optimise this further. I'll leave that to you, should you choose to build up on this answer.

Transactions reshaping from long to wide, joining Buy and Sell dataframes

2 Answers2

Original Answer (while the question was still in development and I wasn't paying close enough attention)

New Answer

Version with for

Version with apply

Version with `for`

Version with `apply`