1

I have a data set containing state city names (i.e. NY NEW YORK), but some of the location names are not correct. There are instances where the location is misspelled or has missing spaces ( i.e. NY NEWW YORK or NY NEWYORK). I have an excel file with all the correct city names and wondering if there is a way to do some type of look up to that file and make the correction in my R script.

My excel reference file is set up like such:

CurrentOrigin     CorrectOrigin
NY NEW YORK        NY NEW YORK
NY NEWW YORK       NY NEW YORK

I've tried writing an if statement but had no luck.

My R skills are limited, but I am making strides everyday.

if (df$Origin == df2$CurrentOrigin) {
     df2$CorrectOrigin ##not sure what to put in the statement
}

Any help would be great! Thank you in advance!

  • 1
    Try importing the Excel using `openxlsx` or `readxl` package or similar then you'll have a datatable in R to work with. – CT Hall Mar 27 '19 at 17:24
  • I have read the reference file in R, but I am struggling to utilize it. – Mike Backes Mar 27 '19 at 17:33
  • Possible duplicate of [Replace values in a dataframe based on lookup table](https://stackoverflow.com/questions/35636315/replace-values-in-a-dataframe-based-on-lookup-table) – CT Hall Mar 27 '19 at 17:54
  • Why not just replace the wrong values with the correct values? ie `df1$Correct<-df2$Correct`?! – NelsonGon Mar 27 '19 at 18:00
  • 1
    @NelsonGon Because I imagine Mike's data is only structured with one instance of each needed correction. – CT Hall Mar 27 '19 at 18:04

1 Answers1

0

With the tidyverse package, how about a SQL style join

library(tidyverse)
df1 <- tibble('ids' = c('1a', '2b' , '3c'),
              'origin' = c('Ny New York', 'NY NEWYORK', 'NY NEWW York'))

refdf <- tibble('CurrentOrigin'= c('NY NEWYORK', 'Ny New York', 'NY NEWW York'),
                'CorrectOrigin' = rep('NY NEWYORK', 3))

df1corr <- left_join(df1, refdf, by = c('origin' = 'Current Origin'))

df1corr
# A tibble: 3 x 3
  ids   origin       CorrectOrigin
  <chr> <chr>        <chr>        
1 1a    Ny New York  NY NEWYORK   
2 2b    NY NEWYORK   NY NEWYORK   
3 3c    NY NEWW York NY NEWYORK       

left_join joins the two tables, keeping all the rows from left hand table df1. You can then check to see if there are any missing data in CorrectOrigin or whatever else you need.

CT Hall
  • 667
  • 1
  • 6
  • 27