-2

I have a dataframe containing one address column. Same addresses are incorrectly spelled and counted as unique. I want to identify and calculate frequencies of similar addresses.

I need new dataframe with following columns:

Address and number of similar occurrences in vector

Thanks

# Install and load the required package

install.packages("stringdist")
library(stringdist)

# Access the "Msl_10" column from the dataframe

data <- Islamabad_Msl_Linelist$Msl_10

# Define a threshold for string similarity

threshold <- 2

# Initialize an empty frequency table

freq_table <- table()

# Iterate over each element in the dataset

for (i in 1:length(data)) {
current_string <- data[i]

# Iterate over each element again to compare with the current string

for (j in (i+1):length(data)) {
comparison_string <- data[j]

    # Compute the string distance
    distance <- stringdist::stringdist(current_string, comparison_string)
    
    # Check if the distance is below the threshold
    if (distance <= threshold) {
      # Update the frequency table
      freq_table[[current_string]] <- freq_table[[current_string]] + 1
    }

}
}

# Print the frequency table

print(freq_table)
  • 3
    Please understand that we don't have your sample data, you are very unlikely to get accurate or substantive help. See https://stackoverflow.com/q/5963269 , [mcve], and https://stackoverflow.com/tags/r/info for suggested uses of `dput`, `data.frame`, or `read.table` for ways to share a small sample of data, and please use the same to provide your expected output given that input. – r2evans Jun 26 '23 at 15:07

1 Answers1

0

Your use of table() is quite unusual.

table is a function that calculates frequencies for you. It is not meant to store data.

If you replace table by numeric(length=length(data)) your code should work and output a vector with number of times a string was close enough to another.

In other news, you should take a look at vectorization. Many functions in R are made to work in vectors, without requiring you loop over each element in each of them. In fact looping in R super slow, and should be avoided whenever possible.

A quick look in the help, and some vectorization experience, suggests that rowSums(stringdist(data,data)<threshold) replaces your code entirely.

JMenezes
  • 1,004
  • 1
  • 6
  • 13