1

I have two data frames, one of which has a large list of two identifiers:

rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A

And one which contains one of the two identifiers:

V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A

What I want is to replace the ID in the second dataframe with the corresponding second ID from the first dataframe (I also couldn't think of a succinct way of phrasing that for the title of this question hence why it's phrased so awkwardly and why I might not have been able to find duplicates). I.e.:

V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    rs6682375  0 14907  G     A

My solution at present is to use a for ... if loop as follows:

for (x in 1:nrow(df2)){

    if (df2$V2[x] %in% df1$uniq_id){

        df2$V2[x] = df1$rsid[x]
    }
}

However, because both files are extremely large, I believe that this is likely a very inefficient way of doing this and am wondering if there is a faster method.

Someone suggested that using the match() function might be quicker, but given that the R documentation for this suggests that %in% is actually more intuitive and my inexperience with it, I'm not sure how to apply it in a different way.

Any help appreciated.

Sabor117
  • 111
  • 1
  • 11

2 Answers2

2

This is an update-join, in data.table terminology. Assuming the first table is called df and the second is called df2:

library(data.table)
setDT(df)
setDT(df2)

df2[df, on = .(V2 = uniq_id), V2 := rsid]

df2
#    V1              V2 V3    V4 V5    V6
# 1:  1    1_10439_A_AC  0 10439  A    AC
# 2:  1 1_13417_CGAGA_C  0 13417  C CGAGA
# 3:  1       rs6682375  0 14907  G     A

Data used

df <- fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- fread('
V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A
')
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • Can I check, what is the purpose of the ```setDT()``` function here? As I understand it would change the data from data frames in data tables, but what is the purpose of that? Particularly as normally when I read in data I actually use: ```fread(x, data.table = FALSE)``` – Sabor117 Jun 28 '19 at 14:37
  • 1
    The bracket `[` is a function, and the function has methods, meaning it's a function which calls different functions depending on the class of the object on which it's acting. Marking `df` as a `data.table` makes `[` call the `[.data.table`function, instead of the usual `[` i.e. `[.data.frame`. One difference between the methods is that it allows you to do update joins. (try running the code without `setDT` and you'll get an error). Actuallty `setDT` does more than change the behaviour of `[` but that's the part relevant to this answer. For more info see `?setDT` – IceCreamToucan Jun 28 '19 at 14:41
  • Hmm, okay. So it's about the different behaviour of the [ ] with data tables as compared with data frames? Also, actually then for the record, in the above example surely you wouldn't need the ```setDT()``` function because you actually initially read the files in as data tables and then also didn't actually change them with the function? (Because surely for that you'd do: ```df = setDT(df)``` ? – Sabor117 Jun 28 '19 at 15:30
  • 1
    That’s why I used it in this answer, yeah. But that’s not the only thing setDT does – IceCreamToucan Jun 28 '19 at 15:31
  • 1
    Yes, it's true I don't need it to run the code in this answer since I read them in with `fread`. I only did that since I assume you already had them loaded, i.e. you weren't reading them in with `fread`. – IceCreamToucan Jun 28 '19 at 15:41
  • I ran a test and this seems to be perfect. The only real thing will be if it's quicker! I also discovered that ```setDT()``` actually automatically converts the data frame to a data table and doesn't need: ```x = setDT(x)``` Thanks for the solution! – Sabor117 Jun 28 '19 at 15:52
  • Going to write a quick update here that my previous method was left running for ~18 hours with no sign of finishing while this method was done in less than 10. Excellent stuff. – Sabor117 Jun 29 '19 at 15:29
  • Ignore my previous (now deleted) comment about adding a key making the join faster. [Apparently that is not true](https://stackoverflow.com/questions/20039335/what-is-the-purpose-of-setting-a-key-in-data-table/20057411#20057411) – IceCreamToucan Jun 29 '19 at 15:40
  • If I'm reading that answer correctly though, it seems like (1) (Running `setkey` + Doing the join) is not faster than (2)(Dong the join), but the "Doing the join" part of the option 1 is faster than the "Doing the join" part of the option 2, so it could be valuable if you want to split the operation into steps which don't have to be executed all at once. Maybe @Arun can comment (Not sure if those "@"s work when he hasn't interacted with this question) – IceCreamToucan Jun 29 '19 at 15:50
1

A method using base r. It would also be easy to perform this using dplyr and it's left_join function if desired.

df <- data.table::fread('
rsid uniq_id
rs796086906 1_13868_G_A
rs546169444 1_14464_T_A
rs6682375 1_14907_G_A
rs6682385 1_14930_G_A
')
df2 <- data.table::fread('
V1 V2 V3 V4 V5 V6
1    1_10439_A_AC  0 10439  A    AC
1    1_13417_CGAGA_C  0 13417  C CGAGA
1    1_14907_G_A  0 14907  G     A
')

df2 <- merge(df2,df,by.x = c("V2"),by.y = c("uniq_id"),all.x = TRUE)
df2$V2 <- ifelse(!is.na(df2$rsid),df2$rsid,df2$V2)
df2$rsid <- NULL

df2

#                 V2 V1 V3    V4 V5    V6
# 1:    1_10439_A_AC  1  0 10439  A    AC
# 2: 1_13417_CGAGA_C  1  0 13417  C CGAGA
# 3:       rs6682375  1  0 14907  G     A
Adam Sampson
  • 1,971
  • 1
  • 7
  • 15
  • 1
    Aha, I see! So you're merging the two files by the ```uniq_id``` to each other then simply shifting the ```rsid``` over. Elegant. The only real question in my mind now is whether this method would be any faster than IceCreamToucan's. Also, from my understanding ```dplyr``` is actually very quick as well, do you know if using that might even be better? – Sabor117 Jun 28 '19 at 15:56
  • 1
    IceCreamToucan's should be faster. Some people struggle with data.table in the beginning though so I included a base method. – Adam Sampson Jun 28 '19 at 18:52
  • 1
    The logic I used also works in dplyr but you would do `df2 %>% left_join(df,by = c("V2" = "uniq_id")) %>% mutate(V2 = ifelse(!is.na(rsid),rsid,V2)) %>% select(-rsid)`. This old post shows how much faster data.table update table is: https://stackoverflow.com/a/51577964/8485287 – Adam Sampson Jun 28 '19 at 18:57