1

I have 2 data frames df2 and DF.

> DF
        date tickers
1 2000-01-01       B
2 2000-01-01    GOOG
3 2000-01-01       V
4 2000-01-01    YHOO
5 2000-01-02     XOM

> df2
        date tickers quantities
1 2000-01-01      BB         11
2 2000-01-01     XOM         23
3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24
8 2000-01-02      BB        422

i need the values from df2 those are present in DF. That means i require the following output:

3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24

So I used the following code:

> subset(df2,df2$date %in% DF$date & df2$tickers %in% DF$tickers)
        date tickers quantities
2 2000-01-01     XOM         23
3 2000-01-01    GOOG         42
4 2000-01-01    YHOO         21
5 2000-01-01       V       2112
6 2000-01-01       B         13
7 2000-01-02     XOM         24

But the output contains one extra column.That is because the ticker 'xom' is present in 2 days in df2. so both rows are selected. What modification is needed in my code?

The dput is as follows:

> dput(DF)
structure(list(date = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("2000-01-01", 
"2000-01-02"), class = "factor"), tickers = structure(c(4L, 5L, 
6L, 8L, 7L), .Label = c("A", "AA", "AAPL", "B", "GOOG", "V", 
"XOM", "YHOO", "Z"), class = "factor")), .Names = c("date", "tickers"
), row.names = c(NA, -5L), class = "data.frame")
> dput(df2)
structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L), .Label = c("2000-01-01", "2000-01-02"), class = "factor"), 
    tickers = structure(c(2L, 5L, 3L, 6L, 4L, 1L, 5L, 2L), .Label = c("B", 
    "BB", "GOOG", "V", "XOM", "YHOO"), class = "factor"), quantities = c(11, 
    23, 42, 21, 2112, 13, 24, 422)), .Names = c("date", "tickers", 
"quantities"), row.names = c(NA, -8L), class = "data.frame")
Dinoop Nair
  • 2,663
  • 6
  • 31
  • 51
  • What do you want to do with the duplicate rows? Take just one, sum them, return the values as separate columns...? – Thomas May 06 '13 at 12:14
  • 1
    are you just looking for `merge(DF,df2)` ... ? that gives the same answer as the `sqldf` answer below ... – Ben Bolker May 06 '13 at 12:24
  • i thought merge() is possible only for data frames having same number of columns.Thats why i asked this question. Thank you for your help. – Dinoop Nair May 06 '13 at 12:30

2 Answers2

3

Using sqldf package:

require(sqldf)

sqldf("SELECT d2.date, d2.tickers, d2.quantities FROM df2 d2 
       JOIN DF d1 ON d1.date=d2.date AND d1.tickers=d2.tickers")

##        date tickers quantities
## 1 2000-01-01    GOOG         42
## 2 2000-01-01    YHOO         21
## 3 2000-01-01       V       2112
## 4 2000-01-01       B         13
## 5 2000-01-02     XOM         24
Nishanth
  • 6,932
  • 5
  • 26
  • 38
1

This is not so different from my answer to this post of yours, but requires a little modification:

df2[duplicated(rbind(DF, df2[,1:2]))[-seq_len(nrow(DF))], ]

#         date tickers quantities
# 3 2000-01-01    GOOG         42
# 4 2000-01-01    YHOO         21
# 5 2000-01-01       V       2112
# 6 2000-01-01       B         13
# 7 2000-01-02     XOM         24

Note: This provides the output with the rows in the same order as it were in df2.


Alternatively, as Ben suggests, using merge:

merge(df2, DF, by=c("date", "tickers"))

will give the same result as well (but not necessarily in the same order).

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387