Left joint dataset with partial string matching [R]

Question

I would like to ask how to perform left joint operation with 2 datasets when variable to joint by has different string.

I know it can be done "by-hand" but I would prefer more efficient approach

I provided 2 datasets in HERE.

Here are some rows from each data set.

all_data

   GID_0 NAME_0   GID_1               NAME_1 NL_NAME_1      GID_2    provstate    VARNAME_2 NL_NAME_2      TYPE_2  ENGTYPE_2 CC_2
1    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.1_1          Ain         <NA>      <NA> Département Department   01
2    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.2_1       Allier Basses-Alpes      <NA> Département Department   03
3    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.3_1      Ardèche         <NA>      <NA> Département Department   07
4    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.4_1       Cantal         <NA>      <NA> Département Department   15
5    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.5_1        Drôme         <NA>      <NA> Département Department   26
6    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.6_1  Haute-Loire         <NA>      <NA> Département Department   43
7    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.7_1 Haute-Savoie         <NA>      <NA> Département Department   74
8    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.8_1        Isère         <NA>      <NA> Département Department   38
9    FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA>  FRA.1.9_1        Loire         <NA>      <NA> Département Department   42
10   FRA France FRA.1_1 Auvergne-Rhône-Alpes      <NA> FRA.1.10_1  Puy-de-Dôme         <NA>      <NA> Département Department   63

data_to_insert

   provstate               count
   <chr>                   <int>
 1 Alpes-Maritimes            13
 2 Alsace                     13
 3 Aquitaine                 119
 4 Aude                        1
 5 Auvergne                    1
 6 Auvergne-Rhone-Alpes       10
 7 Bas-Rhin                    5
 8 Bouches-du-Rhone            9
 9 Bourgogne-Franche-Comte     6
10 Brittany                   23

I need to connect datasets that dataframe called all_data will contain new variable called count that is found in data_to_insert dataframe - joint should be based on provstate variable. (provstatet that do not have values in data_to_insert dataframe should have value 0 in new left joined dataframe)

However names are slightly different in both dataset.

I would like to ask for any advice

maybe [fuzzyjoin](https://cran.r-project.org/web/packages/fuzzyjoin/fuzzyjoin.pdf) can be of some help? Also: A [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) would make it way easier to help you than having to download extra files... ;) — dario, Feb 08 '20 at 13:11
I see many city names in France from both data sets. Some cities exist in both data sets. Some others do not. Can you explain what you expect with clear examples? — jazzurro, Feb 08 '20 at 13:36

score 2 · Answer 1 · answered Feb 08 '20 at 13:57

As far as I inspected, it seems to me that you want to change some letters such as è, ô and é. In all_data you have these letters in the city names. But you do not have city names with these letters in data_to_insert (e.g., Côtes-d'Armor vs. Cotes-d'Armor). Hence, I decided to replace these letters having accents with normal alphabets. This will help your join process to some extent. (Or this is may be enough for you.)

library(dplyr)
library(stringi)

mutate(all_data, provstate = stri_trans_general(provstate, "latin-ascii")) -> all_data

left_join(all_data, data_to_insert, by = "provstate")

Left joint dataset with partial string matching [R]

1 Answers1