5

pandas.merge_asof merges two dataframes, does a left join except it matches on the nearest key rather than equal keys.

Example (stolen from the documentation):

>>> quotes
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03

>>> trades
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

>>> pd.merge_asof(trades, quotes,
...                       on='time',
...                       by='ticker')
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

In the above example, pd.merge_asof matches each row of trades with the row of quotes having the same ticker and the nearest time.

I find this operation quite indispensible in my workflow, and I have been racking my brains trying to think how to accomplish this in R. Of course I could just do the operation in python and read the data frame back in R, but part of my motivation is to learn R.

Alex Smart
  • 83
  • 1
  • 6
  • Please check out https://stackoverflow.com/questions/6709151/how-do-i-combine-two-data-frames-based-on-two-columns – programandoconro Oct 24 '19 at 09:36
  • And also [stackoverflow.com/questions/49156597](https://stackoverflow.com/questions/49156597/merge-data-based-on-nearest-date-r) for join on nearest date. – meriops Oct 24 '19 at 09:58

5 Answers5

2

You can use the data.table package to do a rolling join:

trades[quotes, on=.(ticker, time), roll=-Inf, c("bid","ask") := .(bid, ask)]  

output:

                  time ticker  price quantity    bid    ask
1: 2016-05-25 13:30:00   MSFT  51.95       75  51.95  51.96
2: 2016-05-25 13:30:00   MSFT  51.95      155  51.97  51.98
3: 2016-05-25 13:30:00   GOOG 720.77      100 720.50 720.93
4: 2016-05-25 13:30:00   GOOG 720.92      100 720.50 720.93
5: 2016-05-25 13:30:00   AAPL  98.00      100     NA     NA

data:

library(data.table)

quotes <- fread("time ticker     bid     ask
2016-05-25_13:30:00.023   GOOG  720.50  720.93
2016-05-25_13:30:00.023   MSFT   51.95   51.96
2016-05-25_13:30:00.030   MSFT   51.97   51.98
2016-05-25_13:30:00.041   MSFT   51.99   52.00
2016-05-25_13:30:00.048   GOOG  720.50  720.93
2016-05-25_13:30:00.049   AAPL   97.99   98.01
2016-05-25_13:30:00.072   GOOG  720.50  720.88
2016-05-25_13:30:00.075   MSFT   52.01   52.03")

trades <- fread("time ticker   price  quantity
2016-05-25_13:30:00.023   MSFT   51.95        75
2016-05-25_13:30:00.038   MSFT   51.95       155
2016-05-25_13:30:00.048   GOOG  720.77       100
2016-05-25_13:30:00.048   GOOG  720.92       100
2016-05-25_13:30:00.048   AAPL   98.00       100")

quotes[, time := as.POSIXct(time, format="%Y-%m-%d_%H:%M:%OS")]  
trades[, time := as.POSIXct(time, format="%Y-%m-%d_%H:%M:%OS")]   
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
1

Complex joins can be done using SQL (where the test input is shown reproducibly in the Note at the end). One of the advantages of this approach is that it is quite clear what criteria are being used from the SQL statement.

Assuming you want to join on ticker and time difference less than .002

library(sqldf)

sqldf("select t.*, q.bid, q.ask
  from trades t
  left join quotes q on t.ticker = q.ticker and abs(q.time - t.time) < .002")

giving:

                 time ticker  price quantity    bid    ask
1 2016-05-25 13:30:00   MSFT  51.95       75  51.95  51.96
2 2016-05-25 13:30:00   MSFT  51.95      155     NA     NA
3 2016-05-25 13:30:00   GOOG 720.77      100 720.50 720.93
4 2016-05-25 13:30:00   GOOG 720.92      100 720.50 720.93
5 2016-05-25 13:30:00   AAPL  98.00      100  97.99  98.01

or to join on ticker and minimum time difference:

sqldf("select t.*, q.bid, q.ask, min(abs(q.time - t.time))
  from trades t
  left join quotes q on t.ticker = q.ticker
  group by t.rowid")[1:6]

giving:

                 time ticker  price quantity    bid    ask
1 2016-05-25 13:30:00   MSFT  51.95       75  51.95  51.96
2 2016-05-25 13:30:00   MSFT  51.95      155  51.99  52.00
3 2016-05-25 13:30:00   GOOG 720.77      100 720.50 720.93
4 2016-05-25 13:30:00   GOOG 720.92      100 720.50 720.93
5 2016-05-25 13:30:00   AAPL  98.00      100  97.99  98.01

or to join on minimum difference within time difference of 0.002

sqldf("select t.*, q.bid, q.ask, min(abs(q.time - t.time))
  from trades t
  left join quotes q on t.ticker = q.ticker and abs(q.time - t.time) < 0.002
  group by t.rowid")[1:6]

giving:

                 time ticker  price quantity    bid    ask
1 2016-05-25 13:30:00   MSFT  51.95       75  51.95  51.96
2 2016-05-25 13:30:00   MSFT  51.95      155     NA     NA
3 2016-05-25 13:30:00   GOOG 720.77      100 720.50 720.93
4 2016-05-25 13:30:00   GOOG 720.92      100 720.50 720.93
5 2016-05-25 13:30:00   AAPL  98.00      100  97.99  98.01

Note

Lines1 <- "
                     time ticker     bid     ask
0 2016-05-25T13:30:00.023   GOOG  720.50  720.93
1 2016-05-25T13:30:00.023   MSFT   51.95   51.96
2 2016-05-25T13:30:00.030   MSFT   51.97   51.98
3 2016-05-25T13:30:00.041   MSFT   51.99   52.00
4 2016-05-25T13:30:00.048   GOOG  720.50  720.93
5 2016-05-25T13:30:00.049   AAPL   97.99   98.01
6 2016-05-25T13:30:00.072   GOOG  720.50  720.88
7 2016-05-25T13:30:00.075   MSFT   52.01   52.03"
quotes <- read.table(text = Lines1, as.is = TRUE)
quotes <- transform(quotes, time = as.POSIXct(sub("T", " ", time)))

Lines2 <- "
                     time ticker   price  quantity
0 2016-05-25T13:30:00.023   MSFT   51.95        75
1 2016-05-25T13:30:00.038   MSFT   51.95       155
2 2016-05-25T13:30:00.048   GOOG  720.77       100
3 2016-05-25T13:30:00.048   GOOG  720.92       100
4 2016-05-25T13:30:00.048   AAPL   98.00       100"
trades <- read.table(text = Lines2, as.is = TRUE)
trades <- transform(trades, time = as.POSIXct(sub("T", " ", time)))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

You can also use the data.table package to perform a non-equi join:

quotes[trades, on=.(ticker, time<=time), .(time=i.time, ticker, price, quantity,  bid, ask), mult='last']

This gives more control and is easier to tweak upon other matching criteria. And the result is the same.

                      time ticker  price quantity    bid    ask
1: 2016-05-25 13:30:00.023   MSFT  51.95       75  51.95  51.96
2: 2016-05-25 13:30:00.038   MSFT  51.95      155  51.97  51.98
3: 2016-05-25 13:30:00.048   GOOG 720.77      100 720.50 720.93
4: 2016-05-25 13:30:00.048   GOOG 720.92      100 720.50 720.93
5: 2016-05-25 13:30:00.048   AAPL  98.00      100     NA     NA
Martin Gal
  • 16,640
  • 5
  • 21
  • 39
Wei Zhao
  • 11
  • 2
0

The fuzzyjoin package has exactly this functionality (joining based on criteria). For example: How can I match fuzzy match strings from two datasets?

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50
-1

You can use the merge function to join two data frames in R

merge(trades,quotes,by="ticker",all=TRUE)
Ashish
  • 337
  • 3
  • 11