1

I want to merge two datasets in R.

> dt_1
  co_code co_stkdate bse_closing_price bse_returns
1   21042 1990-01-01             47.50      1.0440
2   21042 1990-01-02             48.75      1.0263
3   21042 1990-01-03             42.50      0.8718
4   22765 1990-01-01             37.00      1.0000
5   22765 1990-01-02             37.00      1.0000
6   22765 1990-01-03             37.00      1.0000


> dt_2
  co_code co_stkdate bse_market_cap  bse_pe
1   21042 1990-01-01       338.8650 10.7236
2   21042 1990-01-02       347.7825 11.0058
3   21042 1990-01-03       303.1950  9.5948
4   21042 1990-01-04       303.1950  9.5948
5   22765 1990-01-01       206.7147  6.2263
6   22765 1990-01-02       206.7147  6.2263
7   22765 1990-01-03       206.7147  6.2263
8   22765 1990-01-09       202.5246  6.1001

The final desired output is:

> dt_3
    co_code co_stkdate bse_closing_price bse_returns bse_market_cap  bse_pe
1     21042 1990-01-01             47.50      1.0440       338.8650 10.7236
2     21042 1990-01-02             48.75      1.0263       347.7825 11.0058
3     21042 1990-01-03             42.50      0.8718       303.1950  9.5948
4     21042 1990-01-04                NA          NA       303.1950  9.5948
5     22765 1990-01-01             37.00      1.0000       206.7147  6.2263
6     22765 1990-01-02             37.00      1.0000       206.7147  6.2263
7     22765 1990-01-03             37.00      1.0000       206.7147  6.2263
8     22765 1990-01-09                NA          NA       202.5246  6.1001

I tried merge but it is giving me a lot of rows (24) which is not desired.

> merge(x = dt_1, y = dt_2, by = "co_code")
   co_code co_stkdate.x bse_closing_price bse_returns co_stkdate.y bse_market_cap  bse_pe
1    21042   1990-01-01             47.50      1.0440   1990-01-01       338.8650 10.7236
2    21042   1990-01-01             47.50      1.0440   1990-01-02       347.7825 11.0058
3    21042   1990-01-01             47.50      1.0440   1990-01-03       303.1950  9.5948
4    21042   1990-01-01             47.50      1.0440   1990-01-04       303.1950  9.5948
5    21042   1990-01-02             48.75      1.0263   1990-01-01       338.8650 10.7236
6    21042   1990-01-02             48.75      1.0263   1990-01-02       347.7825 11.0058
7    21042   1990-01-02             48.75      1.0263   1990-01-03       303.1950  9.5948
8    21042   1990-01-02             48.75      1.0263   1990-01-04       303.1950  9.5948
9    21042   1990-01-03             42.50      0.8718   1990-01-01       338.8650 10.7236
10   21042   1990-01-03             42.50      0.8718   1990-01-02       347.7825 11.0058
11   21042   1990-01-03             42.50      0.8718   1990-01-03       303.1950  9.5948
12   21042   1990-01-03             42.50      0.8718   1990-01-04       303.1950  9.5948
13   22765   1990-01-01             37.00      1.0000   1990-01-01       206.7147  6.2263
14   22765   1990-01-01             37.00      1.0000   1990-01-02       206.7147  6.2263
15   22765   1990-01-01             37.00      1.0000   1990-01-03       206.7147  6.2263
16   22765   1990-01-01             37.00      1.0000   1990-01-09       202.5246  6.1001
17   22765   1990-01-02             37.00      1.0000   1990-01-01       206.7147  6.2263
18   22765   1990-01-02             37.00      1.0000   1990-01-02       206.7147  6.2263
19   22765   1990-01-02             37.00      1.0000   1990-01-03       206.7147  6.2263
20   22765   1990-01-02             37.00      1.0000   1990-01-09       202.5246  6.1001
21   22765   1990-01-03             37.00      1.0000   1990-01-01       206.7147  6.2263
22   22765   1990-01-03             37.00      1.0000   1990-01-02       206.7147  6.2263
23   22765   1990-01-03             37.00      1.0000   1990-01-03       206.7147  6.2263
24   22765   1990-01-03             37.00      1.0000   1990-01-09       202.5246  6.1001

I am new to R and merging. I try reading some stack overflow posts on left join, outer join etc but all of them gives me the same 24 rows. How to join (merge) data frames (inner, outer, left, right)?

Nikhil Vidhani
  • 709
  • 5
  • 11
  • What you have is a many-to-many join. Each value of `co_code` joins to every instance of `co_code` in the other dataset. 4x3 = 12 rows of output for each `co_code`. You need to add a `time` column against each row within each `co_code` group, and add `time` to the `by=` variables. – thelatemail Nov 23 '17 at 04:13
  • @thelatemail Pls have a look at the question again, I edited some fields. I need to merge based on a unique combination of co_code and date. Adding a time variable will help if I somehow put the same time value in each dataset corresponding to the same combination of co_code and date. – Nikhil Vidhani Nov 23 '17 at 04:27
  • 5
    If you have the date, it's just a one-to-one merge - `merge(x = dt_1, y = dt_2, by = c("co_code","co_stkdate"), all=TRUE)` – thelatemail Nov 23 '17 at 04:28
  • @thelatemail It seems to work. I knew I was missing something very basic. Let me test this. Thanks :) – Nikhil Vidhani Nov 23 '17 at 04:33

1 Answers1

0

You can try the dplyr library to merge the two data frames.

Depending on whether you want all rows, just the ones in dt_1 and the ones in dt_2 that match dt_1, or the ones that are present in both frames, you might use -- respectively -- a full_join, a left_join/right_join, or an inner_join.

dt_3 <- full_join(dt_1, dt_2)

Since the columns in common have the same name, you don't need to specify a by condition.

You can check the syntax here as well.

Barbara
  • 1,118
  • 2
  • 11
  • 34