1

I worked for a while and still not finding an efficient way to handle this matter.

I have two data frames and both are very huge.

df1 looks like (one ID could have multiple prices):

ID       Price

1          20
1           9
2          12
3          587
3           59
3           7
4           78
5           12
6          290
6          191
...
1000000    485

df2 looks like(one ID only have one location):

ID       Location     
1           USA
2           CAN
3           TWN
4           USA
5           CAN
6           AUS
...
100000      JAP

I want to create a new data frame looks like (create Location to df1 based on ID):

ID       Price     Location

1          20        USA
1           9        USA
2          12        CAN
3          587       TWN
3           59       TWN
3           7        TWN
4           78       USA
5           12       CAN
6          290       AUS
6          191       AUS
...
1000000    485       JAP

I tried "merge" but R gave me negative length vectors are not allowed. Both lists are huge, one over 2M rows and one over 0.6M rows.

I also tried lapply inside a loop but failed. I cannot figure out how to handle this matter except using two loops (and two nested loops will take a long time).

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • For large data perhaps try the data.table options at the duplicate question. A merge/join will be more efficient than a double loop. Make sure IDs are unique in `df2`. – MrFlick Feb 11 '20 at 19:01
  • @MrFlick, IDs are unique in df2 and I tried the method provided below. Unfortunately, I got an error message. :( – mimibao1009 Feb 11 '20 at 22:08

1 Answers1

0

We can do a join with data.table for efficiently creating the column 'Location'

library(data.table)
setDT(df1)[df2, Location := Location, on = .(ID)]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi, I tried but get an error message as below: Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. – mimibao1009 Feb 11 '20 at 22:05
  • @mimibao1009 Okay, then you may need more memory machine. – akrun Feb 11 '20 at 22:07
  • 1
    It finally worked after I rename the columns from Location to location.... @@" Thank you so much!!!! – mimibao1009 Feb 11 '20 at 22:10
  • I don't understand how the column name affects but I am glad it finally works. Also, I think you may right about the memory part because R ran for a while. – mimibao1009 Feb 11 '20 at 22:12
  • 1
    Again, thank you so much. :-D – mimibao1009 Feb 11 '20 at 22:12
  • 1
    just did. Thank you again. – mimibao1009 Feb 12 '20 at 01:19