Questions tagged [fuzzyjoin]

An R package for joining tables together on inexact matching.

Join tables together based not on whether columns match exactly, but whether they are similar by some comparison. Implementations include string distance, regular expression, or custom matching functions. Uses similar syntax as dplyr's joins.

161 questions
9
votes
1 answer

fuzzyjoin two data frames using data.table

I have been working on a fuzzyjoin to join 2 data frames together however due to memory issues the join causes cannot allocate memory of…. So I am trying to join the data using data.table. A sample of the data is below. df1 looks like: ID …
user8959427
  • 2,027
  • 9
  • 20
9
votes
2 answers

fuzzy join with stringdist_join() in R, Error: NAs are not allowed in subscripted assignments

First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R) I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I…
Brian
  • 113
  • 1
  • 5
9
votes
2 answers

Combined fuzzy and exact matching

I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values of these), I would like to do fuzzy matching on Zipcode, but only for those cases which have exact same StrCity value. I have started with…
PrzeM
  • 211
  • 3
  • 15
8
votes
1 answer

How to fuzzy join based on multiple columns and conditions?

I'm trying to left join two data frames (df1, df2). The data frames have two columns in common: zone and slope. Zone is a factor column and slope is numeric. df1 = data.frame(slope = c(1:6), zone = c(rep("Low", 3), rep("High", 3))) df2 =…
7
votes
1 answer

How to perform a fuzzy join with fuzzyjoin::difference_* in R

I'm working with two different datasets that I want to merge based on a threshold. Let's say the two dataframes look like this: library(dplyr) library(fuzzyjoin) library(lubridate) df1 = data_frame(Item=1:5, DateTime=c("2015-01-01…
tblznbits
  • 6,602
  • 6
  • 36
  • 66
6
votes
1 answer

How to limit fuzzy join only returning one match

I am trying to create a program in R to replace city names or airport names with the three digit airport code. I want to do fuzzy matching to allow more flexibility since the data with the city/airport names I am trying to replace is coming in from…
sarahbarnes
  • 103
  • 2
  • 7
5
votes
0 answers

Using stringdist_left_join to join by multiple columns, but not all of them fuzzy

I have a 1.3 million-row dataset of publications and, for each record, I want to retrieve a paper_id from a second dataset with 8.6 million rows. The idea is to use multiple columns from both tables to find matches for dataset1 in dataset2 as shown…
5
votes
5 answers

How to make a fuzzy join in R using more than one variable on each side

I would like to join the two data frames : a <- data.frame(x=c(1,3,5)) b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b")) with a condition like (x>start)&(x #3 3 b I don't…
Nicolas2
  • 2,170
  • 1
  • 6
  • 15
4
votes
1 answer

Return multiple possible matches when fuzzy joining two dataframes or vectors in R if they share a word in common

Is there a way of joining two dataframes via where a row in the first dataframe is joined with every row in the second dataframe if they share a word in common? For example: companies1 <- data.frame(company_name = c("Walmart", "Amazon", "Apple",…
4
votes
1 answer

What's the best way to fuzzy join (multiple) dataframes by multiple columns?

I need to join multiple data frames, but given that the experiment ran online and participants were often sloppy when entering their ID, I added redundancy. The also had to add letters of their parents name and their zip code. I checked manually (a…
Amir Moye
  • 51
  • 3
4
votes
1 answer

fuzzyjoin with dates in R

I am working on a project where I am analyzing individual-level survey data within countries based on outcomes of sports matches across countries and I am not sure what the most efficient way to produce the merge that I want is. I am working on two…
Julian
  • 451
  • 5
  • 14
4
votes
2 answers

Merging two tables where one column is substring of the other in R

I have two data.frames with columns that contain accession numbers subset of df 1: sub_df1 <- structure(list(database = "CLO, ArrayExpress, ArrayExpress, ATCC, BCRJ, BioSample, CCLE, ChEMBL-Cells, ChEMBL-Targets, Cosmic, Cosmic, Cosmic, Cosmic-CLP,…
Beeba
  • 642
  • 1
  • 7
  • 18
4
votes
1 answer

R: fuzzy join between two datasets

I need to fuzzy match and get the distance between the zip / address inin two distint dataset. Here below an example: name_a <- c("Aldo", "Andrea", "Alberto", "Antonio", "Angelo") name_b <- c("Sara", "Serena", "Silvia", "Sonia",…
claudia
  • 81
  • 7
4
votes
0 answers

Fuzzyjoin match based on two different columns instead of one?

I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this…
ywjong
  • 41
  • 1
4
votes
0 answers

Parallel Fuzzyjoin

I'm trying to speed up a fuzzyjoin with parallel processing. I have two dataframes, each with several thousand rows each which need to be partially regex joined. However its currently taking over 40 minutes on a single core. The dataframe looks…
Highland
  • 148
  • 1
  • 7
1
2 3
10 11