0

I have two data tables.

Table 1: 1349445 rows and 21 cols 
Table 2: 3235 rows x 4 cols

Table 1: 

YEAR    STATE_NAME    CROP     .......
1990    Alabama       Cotton
1990    Alabama       Cotton
1990    Alabama       Peanuts 
.
.
.

Table 2: 

STATE    STATEFP     COUNTYFP    STATE_NAME
AK       2           13          Alaska
AK       2           16          Alaska
AK       2           20          Alaska
AK       2           50          Alaska

I want to merge the two tables by "STATE_NAME"

Table 1 <- data.table(Table 1)
Table 2 <- data.table(Table 2)
setkeyv(Table 1, c("STATE_NAME")) 
setkeyv(Table 2, c("STATE_NAME")) 
Hydra_merge <- merge(Table 1, Table 2, all.x = TRUE)

I am getting the below error. Can somebody help me to figure out what I am doing wrong here.

Thanks in advance.

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 141691725 rows; more than 1352680 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jan 25 '18 at 18:04
  • As you are getting a partial Cartesian join (rows unmatched in table 1 x rows unmatched in table 2) it is likely that you have state names in your tables that do not match. Check the spelling of each, and check for additional spaces in some names or at the end of a name, as these may look identical at first glance. – Stewart Ross Jan 25 '18 at 19:19

1 Answers1

0

I am not sure why nobody answered this yet, and probably this will be useless for OP, but this is quite straightforward!

As the error message states, you have plenty of rows in both tables with repeated keys. If you have two tables with, say, 5 and 6 rows, and the keys are unique, their join will have at least 5 and at most 11 rows (depending on whether all.x, all.y or all) is true.

If, instead, in both tables all rows have the same key, joining them will result in a table with 30 kinda meaningless rows.

as in:

table_1:              table_2:
key   val1            key   val2
  k      a              k      1
  k      b              k      2
  k      c              k      3
  k      d              k      4
  k      e              k      5
                        k      6


merge(table_1, table_2)
key    val1    val2
  k       a       1
  k       a       2
  k       a       3
  k       a       4
     ...     ...
  k       c       2
  k       c       3
  k       c       4
  k       c       5
     ...     ...
  k       e       3
  k       e       4
  k       e       5
  k       e       6


data.table noticed and it's trying to help you. Which is also why it states If you are sure you wish to proceed, rerun with allow.cartesian=TRUE and go home with your, likely wrong but who am I to tell, cartesian product of the two tables.

Now, I am very tempted to try and guess the size of your two tables, given that the sum of their nrows is 1.352.680, the resulting mess of a table has 141.691.725 and the states are 50 (but one of the tables skips Alaska), but maybe next time.

Stefano
  • 219
  • 1
  • 11