0

I want to compare two really large dataframes and make a consensus dataframe after matching a column ID.

Part of my first dataframe(input1):

ID  BGC_Class   Start   End BGC_Name    Similarity  MIBiG
GCA_000006785.2_ASM678v2    Bacteriocin 593677  606065  Streptolysin_S  100%    BGC0000566
GCA_000169475.1_ASM16947v1  Bacteriocin 633235  645623  Streptolysin_S  100%    BGC0000566
GCA_000433555.1_MGS126  Bacteriocin 524573  536961  Streptolysin_S  100%    BGC0000566

second(input2):

ID  Species_name    Strain_name
GCA_000169475.1_ASM16947v1  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_
GCA_000433555.1_MGS126  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_CAG:126__

I want to match 'ID' columns in both dataframe and create a new dataframe (results) after matching ID features in both. So in ideal case, output dataframe would be:

ID  Species_name    Strain_name BGC_Class   Start   End BGC_Name    Similarity  MIBiG
GCA_000169475.1_ASM16947v1  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_ Bacteriocin 633235  645623  Streptolysin_S  100%    BGC0000566
GCA_000433555.1_MGS126  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_CAG:126__ Bacteriocin 524573  536961  Streptolysin_S  100%    BGC0000566

For that, I have tried in R:

results<-data.frame(merge(input1,input2$ID, by.input1 = "input1$ID", by.input2 = "input2$ID"))

and also:

results <- match(input1$ID, input2$ID)

But I am getting same error in both:

Error: vector memory exhausted (limit reached?)

I am wondering if there any memory efficient way of doing this in R?

If not, can it be done by awk/sed scripts for these large dataset files? All comments are appreciated. Thank you.

NB: The original input files are here: https://sites.google.com/site/iicbbioinformatics/share

  • 1
    `by.input1=` is **not** a parameter for `merge()`. – jogo Apr 12 '19 at 06:51
  • Data with 70K and 30K rows is not big. Can easily be done using within R, use data.table to read the data and follow linked post on how to merge. – zx8754 Apr 12 '19 at 07:23

2 Answers2

0

Using left_join should work for you:

library(dplyr)
df <- left_join(input2, input1, by = "ID")
Sonny
  • 3,083
  • 1
  • 11
  • 19
  • Thanks for this. But it gives following 'Warning message: Column `ID` joining factor and character vector, coercing into character vector '. also the input1 contents are blank in df – Abhijit Barerjee Apr 12 '19 at 06:54
  • @AbhijitBarerjee: It's telling you you have ID as a character in one of the dataframes, and a factor in the other. That isn't necessarily a problem, it's just telling you it converted them to match. – Marius Apr 12 '19 at 06:56
  • @AbhijitBarerjee It would be better if you keep the classes of `ID` field same in both the frames – Sonny Apr 12 '19 at 07:03
  • Thanks. I now see.."> inp.tbl$ID %>% class() [1] "character" > all_strains$ID %>% class() [1] "factor"' .. How can I convert input2 to character? – Abhijit Barerjee Apr 12 '19 at 07:18
  • `input2$ID <- as.character(input2$ID)` – Sonny Apr 12 '19 at 07:28
0

I've never hit the memory limit in Awk; would like to know if it happens for you.

awk -F, 'NR==FNR{a[$1]=$0} NR!=FNR{i=$1; $1=""; print a[i] $0}' input2.csv input1.csv

updated to reflect CSV files

ID  Species_name    Strain_name BGC_Class Start End BGC_Name Similarity MIBiG
 Bacteriocin 593677 606065 Streptolysin_S 100% BGC0000566
GCA_000169475.1_ASM16947v1  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_ Bacteriocin 633235 645623 Streptolysin_S 100% BGC0000566
GCA_000433555.1_MGS126  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_CAG:126__ Bacteriocin 524573 536961 Streptolysin_S 100% BGC0000566

Note the second line of output is wonky since there's no matching record. Please let me know how you would like to account for that and I'll prototype something in Awk.

vintnes
  • 2,014
  • 7
  • 16
  • umm.. I am not sure but I am getting blank output file using this. may be I am doing something wrong.. please feel free to try both of my input files here: https://sites.google.com/site/iicbbioinformatics/share – Abhijit Barerjee Apr 12 '19 at 07:07
  • "They're CSVs" would have been relevant information. Updated code. – vintnes Apr 12 '19 at 07:22