I have a dataframe containing one address column. Same addresses are incorrectly spelled and counted as unique. I want to identify and calculate frequencies of similar addresses.
I need new dataframe with following columns:
Address and number of similar occurrences in vector
Thanks
# Install and load the required package
install.packages("stringdist")
library(stringdist)
# Access the "Msl_10" column from the dataframe
data <- Islamabad_Msl_Linelist$Msl_10
# Define a threshold for string similarity
threshold <- 2
# Initialize an empty frequency table
freq_table <- table()
# Iterate over each element in the dataset
for (i in 1:length(data)) {
current_string <- data[i]
# Iterate over each element again to compare with the current string
for (j in (i+1):length(data)) {
comparison_string <- data[j]
# Compute the string distance
distance <- stringdist::stringdist(current_string, comparison_string)
# Check if the distance is below the threshold
if (distance <= threshold) {
# Update the frequency table
freq_table[[current_string]] <- freq_table[[current_string]] + 1
}
}
}
# Print the frequency table
print(freq_table)