How can I add the country name to a dataset based on city name and population?

Question

I have a dataset containing information on a range of cities, but there is no column which says what country the city is located in. In order to perform the analysis, I need to add an extra column which has the name of the country.

population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin

I expect the output to look like this:

population    city          country
500,000       Oslo          Norway
750,000       Bristol       England
500,000       Liverpool     England
1,000,000     Dublin        Ireland

How can I add a column of country names based on the city and population to a large dataset in R?

I don't really see how this is an R coding question without a file which maps cities to countries - do you have this? — SamR, Jul 13 '22 at 13:12
I dont. The data is not that big that I cant do it manually. F.ex Bristol could appear 5/300 times in the dataset, so I would need a code that could print "England" in the new column on all rows that has "bristol" in the city column — user18723720, Jul 13 '22 at 13:15
Right... but all four cities you have given also exist in the US. Without the data to map them this is not an R coding question. I am going to vote to close this until you have that data but feel free to edit or ask another question once you do. — SamR, Jul 13 '22 at 13:17
There is a country/city API as well that is a bit more extensive than the data in the `maps` package, but this doesn't solve the other issues people have brought up. `cities <- httr::GET("https://countriesnow.space/api/v0.1/countries")` `cities <- httr::content(cities)$data` — DaveArmstrong, Jul 13 '22 at 13:23
I am sorry, but too be more clear: I dont need an automatic fill in. I know the countries the cities are located in. I would need to create an empty column, then fill in all the rows with Bristol with England, and all the rows with Dublin with Ireland f.ex. If thats more understandable? — user18723720, Jul 13 '22 at 13:33
@M-- I am hesitant to start a reopen war, but I am not sure this is a reasonable duplicate target. In this question, the author has a city name *and* a population. In the duplicate, the author has a single column which may contain the country name but may not. I think they are different enough to not be a duplicate. Moreover, I think Tom's solution is the best answer to the duplicate target. If anything, the duplicate closure should be the other way. — Ian Campbell, Jul 13 '22 at 20:55

score 2 · Answer 1 · answered Jul 13 '22 at 14:17

I am adapting Tom Hoel's answer, as suggested by Ian Campbell. If this is selected I am happy to mark it as community wiki.


library(maps)
library(dplyr)
data("world.cities")

df <- readr::read_table("population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin")


df   |>
  inner_join(
    select(world.cities, name, country.etc, pop), 
    by = c("city" = "name")
  )  |> group_by(city)  |> 
  filter(
    abs(pop - population) == min(abs(pop - population))
    )
        
# A tibble: 4 x 4
# Groups:   city [4]
#   population city      country.etc     pop
#        <dbl> <chr>     <chr>         <int>
# 1     500000 Oslo      Norway       821445
# 2     750000 Bristol   UK           432967
# 3     500000 Liverpool UK           468584
# 4    1000000 Dublin    Ireland     1030431

score 1 · Answer 2 · answered Jul 13 '22 at 13:21

As stated by others, the cities exists in other countries too as well.

library(tidyverse)
library(maps)

data("world.cities")

df <- read_table("population    city
500,000       Oslo
750,000       Bristol
500,000       Liverpool
1,000,000     Dublin")

df %>% 
  merge(., world.cities %>%
          select(name, country.etc), 
        by.x = "city", 
        by.y = "name") 

# A tibble: 7 × 3
  city      population country.etc
  <chr>          <dbl> <chr>      
1 Bristol       750000 UK         
2 Bristol       750000 USA        
3 Dublin       1000000 USA        
4 Dublin       1000000 Ireland    
5 Liverpool     500000 UK         
6 Liverpool     500000 Canada     
7 Oslo          500000 Norway

I like this - using this dataset makes it an R question. But if you're going to do it this way then why not add to the end of the pipe `group_by(city) |> filter(abs(pop - population) == min(abs(pop - population)))` ? Then you get one match per city, based on wherever has the closest population. — SamR, Jul 13 '22 at 13:39

score -2 · Answer 3 · answered Jul 13 '22 at 13:16

-2

I think your best bet would be to add a new column in your dataset called country and fill it out, this is part of the CRSIP-DM process data preparation so this is not uncommon. If that does not answer your question please let me know and i will do my best to help.

answered Jul 13 '22 at 13:16

ThatGuy

1
4

Do you have a code that could make this example possible? F.ex Bristol could appear 5/300 times in the dataset, so I would need a code that could print "England" in the new column on all rows that has "bristol" in the city column – user18723720 Jul 13 '22 at 13:18
If you do not want to do it manually then you can just add the column country and add a switch statement where if "bristol" comes up than country == England, and do the same for all the other cities. Its what i would do, but looking at the example above i am confused because you have liverpool > UK and liverpool > canada – ThatGuy Jul 13 '22 at 13:28

How can I add the country name to a dataset based on city name and population?

3 Answers3