0

I have a database of death records from a Brazilian city the nineteenth century, in which it contains many information, including place of death of each individual record. Unfortunately, the dataset has many inconsistencies, due to an overall lack of protocol when the database was constructed.

I'm trying to standardize the column containing places of death, in an attempt to spatialize this data in a GIS software. I've been using grepl() in RStudio, but have noticed it still doesn't work completely. Let me show an example:

amachado <- grepl("alvares machado,|alvares machado|rua alvarez machado|Rua Alvares Machado|rua alvaro machado|rua alves machado, no 14", recorte$local, ignore.case = TRUE)

recorte$local[amachado] <- "Rua Álvares Machado"

When I check in "recorte", "rua alves machado, no 14" doesn't get transformed properly. In this case I copied exactly the information written in the database but it still doesn't work.

jpsmith
  • 11,023
  • 5
  • 15
  • 36
  • 2
    We cant provide assistance without the data in `recorte` or without a better understanding of what "doesn't get transformed properly" specifically means - could you edit your question to provide these data in a reproducible way? For some tips on how to do that, see [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – jpsmith Aug 01 '23 at 18:13
  • 1
    Not entirely sure I understand your question, but I think you should look into using regular expressions. Something like `(?i)((A|Á)\w+ machado)` should match all the strings listed above as well as any other string that contains a word starting with the character "a" or "Á" followed by the word "machado". – Adam Aug 01 '23 at 18:38
  • 1
    Taking the `"rua alves machado, no 14"` example you mention in text, `grepl(...your regex pattern..., "rua alves machado, no 14", ignore.case = TRUE)` returns `TRUE`, so we can't see the issue. There are a wide range of possible problems, if you could use `dput()` to share a copy/pasteable sample of data including the class and structure information of a few rows that have issues, we can try to debug. – Gregor Thomas Aug 01 '23 at 19:10
  • Hi Giulia! Welcome to StackOverflow! – Mark Aug 02 '23 at 08:43
  • As someone who has worked with messy raw data a fair amount, I'm unsure what the question is asking. If the data is as inconsistent as you say it is, it's beyond the scope of a StackOverflow (it would be a *very very* long regex) or if it's just the example, then it isn't an issue, because it seems to work already – Mark Aug 02 '23 at 08:44

0 Answers0