I have two data.frames with columns that contain accession numbers
subset of df 1:
sub_df1 <- structure(list(database = "CLO, ArrayExpress, ArrayExpress, ATCC, BCRJ, BioSample, CCLE, ChEMBL-Cells, ChEMBL-Targets, Cosmic, Cosmic, Cosmic, Cosmic-CLP, GDSC, GEO, GEO, GEO, IGRhCellID, LINCS_LDP, Wikidata",
database_accession = "CLO_0009006, E-MTAB-2770, E-MTAB-3610, CRL-7724, 0337, SAMN03471142, SH4_SKIN, CHEMBL3308177, CHEMBL2366309, 687440, 909713, 2159447, 909713, 909713, GSM887568, GSM888651, GSM1670420, SH4, LCL-1280, Q54953204"), .Names = c("database",
"database_accession"), row.names = 2L, class = "data.frame")
subset of df 2:
sub_df2 <- structure(list(database_accession = "SH4_SKIN", G1 = -1.907138,
G2 = -7.617305, G3 = -3.750553, G4 = 2.615004, G5 = 9.751557), .Names = c("database_accession",
"G1", "G2", "G3", "G4", "G5"), row.names = 101L, class = "data.frame")
I would like to merge the two dataframes by the column database_accession
but the problem is they are not exact matches. the string insub_df2
is a substring of the string in sub_df1
.
I thought about using fuzzyjoin but having a hard time getting the match algorithm right.