Dealing with spaces when subsetting by excluding a series of strings

Question

I have a dataframe that looks like this:

Author ID     Country Year
A      12345  US      2011
B      13254  Germany 2018
C      54952  Belgium 2005
D      58774  UK      2009
E      88569  Lebanon 2015
...

I want to exclude all countries that are part of the EU and the USA. However, I am having trouble with countries that contain a space, for example Czech Republic and United Kingdom.

I have so far tried using

non_other_countries<-c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States")
other_post_2011 <- other_post_2011_with_id[, setdiff(names(other_post_2011_with_id), non_other_countries)]

and

other_post_2011 <- subset(other_post_2011_with_id, ! Country %in% c("Belgium", "Bulgaria", "Demnark", "Germany", "Estonia", "Finland", "France", "Greece", "Ireland", "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Austria", "Poland", "Portugal", "Romania", "Slovakia", "Slovania", "Spain", "Sweden", "Czech Republic", "Hungary", "United Kingdom", "Cyprus", "United States", "USA"))

However, neither were able to exclude countries that contained a space.

I have right now developed a (imo) quite ugly workaround solution by replacing all Czech Republic with Czechia and all United Kingdom with UK by

other_post_2011_with_id$Country[other_post_2011_with_id$Country == "Czech Republic"] <- "Czechia"
other_post_2011_with_id$Country[other_post_2011_with_id$Country == "United Kingdom"] <- "UK"

but I had been wondering if there was any other more elegant and also universal solution to this. Thank you very much!

score 1 · Accepted Answer · answered Aug 28 '19 at 19:34

Don't know what exactly went wrong with your code since your provided data is incomplete, but try following approach.

head(dat)
#   a id        country year
# 1 a  1 United Kingdom 2006
# 2 b  5  Bouvet Island 2010
# 3 c  8        Hungary 2010
# 4 d 10 Czech Republic 2004
# 5 e 12  Bouvet Island 2001
# 6 f 19 United Kingdom 2004

excl <- c("Czech Republic", "Hungary", "United Kingdom", "Cyprus", 
          "United States")

dat[!dat$country %in% excl, ]
#    a id       country year
# 2  b  5 Bouvet Island 2010
# 5  e 12 Bouvet Island 2001
# 7  g 20      Dominica 2004
# 9  i 32       Namibia 2000
# 10 j 34 Bouvet Island 2011
# 11 k 35 Bouvet Island 2001
# 12 l 52 Bouvet Island 2010
# 13 m 54      Dominica 2005
# 14 n 56       Namibia 2000
# 17 q 77 Bouvet Island 2001
# 18 r 79         Qatar 2011
# 19 s 82 Bouvet Island 2002

Data

dat <- structure(list(a = structure(1:20, .Label = c("a", "b", "c", 
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", 
"q", "r", "s", "t"), class = "factor"), id = c(1L, 5L, 8L, 10L, 
12L, 19L, 20L, 31L, 32L, 34L, 35L, 52L, 54L, 56L, 61L, 67L, 77L, 
79L, 82L, 90L), country = structure(c(8L, 1L, 5L, 3L, 1L, 8L, 
4L, 2L, 6L, 1L, 1L, 1L, 4L, 6L, 5L, 2L, 1L, 7L, 1L, 3L), .Label = c("Bouvet Island", 
"Cyprus", "Czech Republic", "Dominica", "Hungary", "Namibia", 
"Qatar", "United Kingdom"), class = "factor"), year = c(2006L, 
2010L, 2010L, 2004L, 2001L, 2004L, 2004L, 2009L, 2000L, 2011L, 
2001L, 2010L, 2005L, 2000L, 2001L, 2006L, 2001L, 2011L, 2002L, 
2003L)), class = "data.frame", row.names = c(NA, -20L))

Thank you! I am still a bit unsure as on how to share data here, I know of course `dput`but my file has roughly 10 000 rows...how would you go about sharing something this large on Stackoverflow? — P.Weyh, Aug 29 '19 at 12:02
@P.Weyh You're welcome and thanks for asking this. Usually there is no no need to share a huge dataset to demonstrate a coding problem and we better provide a [mcve]. E.g. what I did in this answer can easily be scaled up to a dataset of any size. Often it's useful to create some toy data (I also do this very often just for myself to isolate a problem and to solve it easier). How you can do that, you may want to read through this: https://stackoverflow.com/a/5963610/6574038 — jay.sf, Aug 29 '19 at 12:32

score 1 · Answer 2 · answered Aug 29 '19 at 07:44

A solution a little more elegant to the one you are proposing:

You could replace the white space with an underscore before running your code:

df$Country <- gsub(" ", "_", df$Country)

then run your code

and undo the replacement:

df$Country <- gsub("_", " ", df$Country)

However the white space is unlikely to be the reason for your issue. Try excluding the countries you want with:

df <- df[!(df$Country %in% c("Country3","Country3","Country3")]

the whitespaces in the character strings should not affect the result, if you are consistent in their usage. This is just an assumption, but some country names may have more than one white space. As "United States" and "United States" is often difficult do distinguish it is always recommended to use "_".

Hope this helps!

Dealing with spaces when subsetting by excluding a series of strings

2 Answers2

Data