1

I have this problem in R where I have a list of Spanish communities and inside each community there is a list of towns/municipalities.

For example, this is a list of municipalities inside the community of Catalonia.

https://en.wikipedia.org/wiki/Municipalities_of_Catalonia

So; Catalonia is one community and within this community it has a list of towns/cities which I would like to group/ assign a new value 'Catalona'.

I have a list of all the municipalities/towns/cities in my dataset and I would like to group them into communities such as; Andalusia, Catalonia, Basque Country, Madrid etc.

Firstly, how can I go about grouping these rows into the list of communities?

For example; el prat de llobregat is a municipality within Catalonia so I would like to assign this to the region of Catalonia. Getafe is a municipality of Madrid so I would like to assign this to a value of Madrid. Alicante is a municipality of Valencia so I would like to assign this to a value Valencia. Etc.

#

That was my first question and if you are able to help with just that, I would be very thankful.

However, my dataset is not that clean, I did my best to remove Spanish accents, remove unnecessary code identifiers in the municipality names but there still contains some small errors. For example, castellbisbal is a municipality of Catalonia, however some entries have very small spelling mistakes, i.e. including 1 'l' instead of two, spelling; (castelbisbal).

These errors are human errors and are very small, is there a way I can work around this?

I was thinking of a vector of all correctly spelt names and then rename the incorrectly spelt names based on a percentage of incorectness, could this work? For instance castellbisbal is 13 characters long, and has an error of 1 character, with less than an 8% error rate. Can I rename values based on an error rate?

Do you have any suggestions on how I can proceed with the second part?

Any tips/suggestions would be great.

user113156
  • 6,761
  • 5
  • 35
  • 81
  • Welcome to Stackexchange. Did you do any research on how you could achieve this? People are much more likely to help you if you give code examples on what you already tried and where you failed. A [reproducible example](https://stackoverflow.com/q/5963269/7508461) goes a long way in that direction. – Numb3rs Jul 20 '17 at 12:19
  • Hi, Thank you for the welcome, I am currently doing my research, at the moment I have a big list of towns and cities and I am currently assigning these manually to the 50 or so communities in Spain, once I have finished I will upload the results – user113156 Jul 20 '17 at 12:54
  • Just a small update: I have managed to solve the first part, I was fortunate enough to have numerica identification I could merge the datasets by, allowing me to merge communities and municipalities by the unique community ID. I am still stuck on the second part, so any help here would be greatly appreciated. – user113156 Jul 20 '17 at 13:49

1 Answers1

0

As for the spelling errors, have you tried the soundex algorithm? It was meant for that and at least two R packages implement it.

library(stringdist)

phonetic("barradas")
[1] "B632"
phonetic("baradas")
[1] "B632"

And the soundex codes for for the same words are the same with package phonics.

library(phonics)

soundex("barradas")
[1] "B632"
soundex("baradas")
[1] "B632"

All you would have to do would be to compare soundex codes, not the words themselves. Note that soundex was designed for the english language so it can only handle english language characters, not accents. But you say you are already taking care of those, so it might work with the words you have to process.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66