I want to compare two really large dataframes and make a consensus dataframe after matching a column ID.
Part of my first dataframe(input1):
ID BGC_Class Start End BGC_Name Similarity MIBiG
GCA_000006785.2_ASM678v2 Bacteriocin 593677 606065 Streptolysin_S 100% BGC0000566
GCA_000169475.1_ASM16947v1 Bacteriocin 633235 645623 Streptolysin_S 100% BGC0000566
GCA_000433555.1_MGS126 Bacteriocin 524573 536961 Streptolysin_S 100% BGC0000566
second(input2):
ID Species_name Strain_name
GCA_000169475.1_ASM16947v1 [Ruminococcus]_gnavus [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_
GCA_000433555.1_MGS126 [Ruminococcus]_gnavus [Ruminococcus]_gnavus_CAG:126__
I want to match 'ID' columns in both dataframe and create a new dataframe (results) after matching ID features in both. So in ideal case, output dataframe would be:
ID Species_name Strain_name BGC_Class Start End BGC_Name Similarity MIBiG
GCA_000169475.1_ASM16947v1 [Ruminococcus]_gnavus [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_ Bacteriocin 633235 645623 Streptolysin_S 100% BGC0000566
GCA_000433555.1_MGS126 [Ruminococcus]_gnavus [Ruminococcus]_gnavus_CAG:126__ Bacteriocin 524573 536961 Streptolysin_S 100% BGC0000566
For that, I have tried in R:
results<-data.frame(merge(input1,input2$ID, by.input1 = "input1$ID", by.input2 = "input2$ID"))
and also:
results <- match(input1$ID, input2$ID)
But I am getting same error in both:
Error: vector memory exhausted (limit reached?)
I am wondering if there any memory efficient way of doing this in R?
If not, can it be done by awk/sed scripts for these large dataset files? All comments are appreciated. Thank you.
NB: The original input files are here: https://sites.google.com/site/iicbbioinformatics/share