how to select country name and remove special characters from a csv file in R

Question

I have a data set that looks like this:

I was wondering how can I select the name of the country only from this column, as you can see, the words are separated by a comma, sometimes the country name is the second word, sometimes its the 3rd word and sometimes its the first word, I was wondering, how can I create another column with the country names only? The data set also has special characters, I was wondering is there a way to remove special characters from a csv file in R? If someone could help me figure this out, I would really appreciate it

thank you!

How can you explain to a computer what a country name is? Are you going to supply a list? Is seems that some of those rows don't have country names at all. Do you know how those special characters got in your data? Is it possible you are just using the wrong character encoding when you import your data so the real value is mangled? It's really hard to tell what's going on just looking at some image. It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input. — MrFlick, Jun 29 '21 at 01:15
this is a csv extract we received, so I do not have control over how the data was extracted on their end unfortunately — comp_user, Jun 29 '21 at 13:22

Zhiqiang Wang · Answer 1 · 2021-06-29T02:02:41.107

2

To start, you may want to try something like this. First use countrycode to get a list of countries. Then, pick up the country names. For special characters, you may want to try countrycode::codelist$country.name.en.regex instead.

I just learned the paste(country_list, collapse="|") trick from @akrun previous post Search in character string with list of strings and return match

library(tidyverse)
library(countrycode)

df <- data.frame(
  messy_address = c("1 street, district, China", "2 road, city, Australia", "3 Road Canada"))

country_list<-countrycode::codelist$country.name.en

df$new_country <- str_extract(messy_address, paste(country_list, collapse="|"))
df

#> df
#>              messy_address new_country
#> 1 1 street, district, China       China
#> 2   2 road, city, Australia   Australia
#> 3             3 Road Canada      Canada

edited Jun 29 '21 at 02:02

answered Jun 29 '21 at 01:24

Zhiqiang Wang

6,206
2
13
27

so I tried the exact thing you did, but it gave me an error saying "country_list not found". This is what I did: plane_crash <- read.csv("C:\\Documents\\plane_crashes_June21.csv")%>% janitor::clean_names() install.packages("countrycode") country_list <- countrycode::codelist$country.name.en new_plane <- str_extract(plane_crash, paste(country_list, collapse="|")) – comp_user Jun 29 '21 at 22:54
If you provide a small sample of `plane_crash` data and your code, I can try it on my side – Zhiqiang Wang Jun 29 '21 at 23:31
could you please tell me how I can share my csv file with you? (sorry I am not really sure how to share a file through here) appreciate your help! – comp_user Jun 30 '21 at 06:44
You may not need to share csv file. Read csv file on your machine and use `dput` or select a subset of your data to show on SO, please check the link provided by @MrFlick in the above comment for details. – Zhiqiang Wang Jun 30 '21 at 08:53

how to select country name and remove special characters from a csv file in R

1 Answers1