0

I have a data-set in excel spreadsheet where the row-names are duplicates or near-duplicates, like shown in the image below. Is there a 'tidy' way (in R) of cleaning this? Even a partial solution like merging different rows with different cases (upper, lower, sentence) would be much appreciated.

image of a few rows of the dataset showing the duplicate city names

I am familiar with clean_names (e.g. https://rpubs.com/jenrichmond/clean_names) but I was not able to find one that does this specific job. I want to create another data-frame where I have just one row for each unique city name.

Phil
  • 7,287
  • 3
  • 36
  • 66
  • Different cases is easy - convert them all to the same case and do a grouped sum: `your_data |> rownames_to_column("name") |> mutate(name = tolower(name)) |> summarize(Count = sum(Count), .by = name))`. Correcting typos like `ABILENE` vs `ABLIENE` is quite a bit more difficult - it's called "fuzzy deduplication" and [this is a pretty good place to start](https://stackoverflow.com/q/74368289/903061) / potential duplicate for that part of the question. – Gregor Thomas Sep 03 '23 at 02:47

0 Answers0