1

I'm sure I'm not using/understanding some of the vector features of R. This code works, but as you can probably guess it's incredibly slow.

The code is for identifying cities, states, postal codes and/or country names in 3 different columns of dirty addresses and parsing them into separate city/state/country/zip columns. So I've got the geonames dataset of world cities and I'm using grepl to slowly and painfully iterate through several columns of addresses and several columns of geonames data.

I know R is built specifically for dealing with large datasets, and that iterating is probably very very very wrong, but I'm not sure how else to implement something like this. After you're done laughing at this monstrosity I've created, some advice on how to fix it would be appreciated.

addresses <- c('addressColumn1', 'addressColumn2', 'addressColumn3')
matches = 0

#loop through each row of the dirty dataset
for(i in 1:nrow(dirty_data) {

  #Used to move on to next row if a match is found
  stopLooking = FALSE

  #loop through each of the address columns in the dirty dataset
  for(addr in addresses){

    matches=0

    #Loop through each row of the geonames data
    for(z in 1:nrow(geonames)) {

      #Skip iteration if a match has been found
      if(stopLooking)
        break

      #For each data point in geonames such as city/state/country/zip
      for(place in names(geonames)) {

        #Quick and dirty handling of NA values from grepl()
        if(is.na(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE)))
          skip = TRUE

        #if city or state or zip or country name/code is found in address
        if(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE) && !skip) {
          matches = matches + 1
        }

        #If all the geonames columns have been checked, determine if match found
        if(place == names(geonames)[length(names(geonames))]) {
          if(matches > 1) {
            print(paste('Match found z=', z, ' i=', i, ' addr=', addr, sep=''))
            dirty_data[i, paste(addr, 'address_city', sep='_')] <- geonames[z, 'place_name']
            dirty_data[i, paste(addr, 'address_state', sep='_')] <- geonames[z, 'admin_name1']
            dirty_data[i, paste(addr, 'address_postal_code', sep='_')] <- geonames[z, 'postal_code']
            dirty_data[i, paste(addr, 'address_country', sep='_')] <- geonames[z, 'country_name']
            stopLooking = TRUE
          }
        }
      }
    }
  }
}

Example geonames data:

country_code <- c('US', 'US', 'US', 'US', 'US', 'US')
postal_code <- c(99553, 99571, 99583, 99612, 99661, 99546)
place_name <- c('Akutan', 'Cold Bay', 'False Pass', 'King Cove', 'Sand Point', 'Adak')
admin_code1 <- c('AK', 'AK', 'AK', 'AK', 'AK', 'AK')
admin_name1 <- c('Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska')
country_name <- c('United States', 'United States', 'United States', 'United States', 'United States', 'United States')

geonames <- data.frame(country_code, postal_code, place_name, admin_code1, admin_name1, country_name)

Example dirty data:

addressColumn1 <- c('1 Salmon Ln, Akutan, AK 99553', 'Sandon Rd, Stone ST15 8XS, UK',',,,,,,London')
addressColumn2 <- c(' Mexico', 'Holger-Börner-Platz 1, 34119 Kassel, Germany', 'Atlanta, GA')
addressColumn3 <- c('London', 'Iceland', '123 Madeup Street, Minneapolis')

dirty_data <- data.frame(addressColumn1, addressColumn2, addressColumn3)

Example output:

The code above will work and produce the desired output, which is filling in columns in the existing dataframe. I've added more detail below.

Filling in the columns at the bottom of the last for loop (they output column name + _address_city etc) with matching info

For example, dirty_data[1, 'addressColumn1] == '1 Salmon Ln, Akutan, AK 99553' which would match geonames[1, 'postal_code] == 99553, geonames[1, 'place_name'] == 'Akutan', and geonames[1, 'admin_code1'] == 'AK'.

Since that is 2 or more matches, it would fill in the columns (the column names are created using the paste command, shown here how they actually output for simplicity)

dirty_data[1, 'addressColumn1_address_postal_code'] <- 99553 
dirty_data[1, 'addressColumn1_address_place_name'] <- 'Akutan' 
dirty_data[1, 'addressColumn1_address_admin_code1'] <- 'AK'
dirty_data[1, 'addressColumn1_address_admin_name1'] <- 'Alaska'
dirty_data[1, 'addressColumn1_address_country_name'] <- 'United States'
jamzsabb
  • 1,125
  • 2
  • 18
  • 40
  • I'm sure there would be a better and faster way. But, can you provide few rows of input data and the expected output? You can refer to: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – YOLO Mar 16 '18 at 20:14
  • can you just make some fake dirty data that is not your actual data but suffices to illustrate the problem? It is hard to imagine code fully formed, it requires testing, and without input data we can't do that. – Calum You Mar 16 '18 at 20:38
  • You're right, I updated the sample data with simple copy/pasteable code – jamzsabb Mar 16 '18 at 21:04
  • Ok and what is the desired output? Please make a reproducible example... did you read the link from Manish? What is `addr` in `x=toupper(dirty_data[i ,addr])`? Is each cell in `dirty_data` an `addr`? [See here](http://idownvotedbecau.se/nomcve/) for guides on why we need minimal, complete and verifiable examples to be able to answer questions without constant back and forth in comments. – Calum You Mar 16 '18 at 22:02
  • @CalumYou My desired output is a little confusing since its just filling in columns, but I've done my best to explain what I'm looking for. I know I gotta help you guys help me, just unsure how to illustrate some of this. Thanks for any help though, I'm really curiuos what I'm missing here as this is probably the worst and most inefficient code I've ever written. – jamzsabb Mar 19 '18 at 14:13
  • 1
    It's really not a monstrosity. Room for improvement, and probably too slow, but hardly monstrous. – Hugh Mar 19 '18 at 14:14

0 Answers0