I'm sure I'm not using/understanding some of the vector features of R. This code works, but as you can probably guess it's incredibly slow.
The code is for identifying cities, states, postal codes and/or country names in 3 different columns of dirty addresses and parsing them into separate city/state/country/zip columns. So I've got the geonames dataset of world cities and I'm using grepl to slowly and painfully iterate through several columns of addresses and several columns of geonames data.
I know R is built specifically for dealing with large datasets, and that iterating is probably very very very wrong, but I'm not sure how else to implement something like this. After you're done laughing at this monstrosity I've created, some advice on how to fix it would be appreciated.
addresses <- c('addressColumn1', 'addressColumn2', 'addressColumn3')
matches = 0
#loop through each row of the dirty dataset
for(i in 1:nrow(dirty_data) {
#Used to move on to next row if a match is found
stopLooking = FALSE
#loop through each of the address columns in the dirty dataset
for(addr in addresses){
matches=0
#Loop through each row of the geonames data
for(z in 1:nrow(geonames)) {
#Skip iteration if a match has been found
if(stopLooking)
break
#For each data point in geonames such as city/state/country/zip
for(place in names(geonames)) {
#Quick and dirty handling of NA values from grepl()
if(is.na(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE)))
skip = TRUE
#if city or state or zip or country name/code is found in address
if(grepl(pattern=toupper(geonames[z, place]), x=toupper(dirty_data[i ,addr]), fixed=TRUE) && !skip) {
matches = matches + 1
}
#If all the geonames columns have been checked, determine if match found
if(place == names(geonames)[length(names(geonames))]) {
if(matches > 1) {
print(paste('Match found z=', z, ' i=', i, ' addr=', addr, sep=''))
dirty_data[i, paste(addr, 'address_city', sep='_')] <- geonames[z, 'place_name']
dirty_data[i, paste(addr, 'address_state', sep='_')] <- geonames[z, 'admin_name1']
dirty_data[i, paste(addr, 'address_postal_code', sep='_')] <- geonames[z, 'postal_code']
dirty_data[i, paste(addr, 'address_country', sep='_')] <- geonames[z, 'country_name']
stopLooking = TRUE
}
}
}
}
}
}
Example geonames data:
country_code <- c('US', 'US', 'US', 'US', 'US', 'US')
postal_code <- c(99553, 99571, 99583, 99612, 99661, 99546)
place_name <- c('Akutan', 'Cold Bay', 'False Pass', 'King Cove', 'Sand Point', 'Adak')
admin_code1 <- c('AK', 'AK', 'AK', 'AK', 'AK', 'AK')
admin_name1 <- c('Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska', 'Alaska')
country_name <- c('United States', 'United States', 'United States', 'United States', 'United States', 'United States')
geonames <- data.frame(country_code, postal_code, place_name, admin_code1, admin_name1, country_name)
Example dirty data:
addressColumn1 <- c('1 Salmon Ln, Akutan, AK 99553', 'Sandon Rd, Stone ST15 8XS, UK',',,,,,,London')
addressColumn2 <- c(' Mexico', 'Holger-Börner-Platz 1, 34119 Kassel, Germany', 'Atlanta, GA')
addressColumn3 <- c('London', 'Iceland', '123 Madeup Street, Minneapolis')
dirty_data <- data.frame(addressColumn1, addressColumn2, addressColumn3)
Example output:
The code above will work and produce the desired output, which is filling in columns in the existing dataframe. I've added more detail below.
Filling in the columns at the bottom of the last for loop (they output column name + _address_city etc) with matching info
For example, dirty_data[1, 'addressColumn1] == '1 Salmon Ln, Akutan, AK 99553'
which would match geonames[1, 'postal_code] == 99553
, geonames[1, 'place_name'] == 'Akutan'
, and geonames[1, 'admin_code1'] == 'AK'
.
Since that is 2 or more matches, it would fill in the columns (the column names are created using the paste command, shown here how they actually output for simplicity)
dirty_data[1, 'addressColumn1_address_postal_code'] <- 99553
dirty_data[1, 'addressColumn1_address_place_name'] <- 'Akutan'
dirty_data[1, 'addressColumn1_address_admin_code1'] <- 'AK'
dirty_data[1, 'addressColumn1_address_admin_name1'] <- 'Alaska'
dirty_data[1, 'addressColumn1_address_country_name'] <- 'United States'