2

I am working with census data in a region. I have the total, urban, and rural populations of each municipality in the region, identified by a municipality code. In my data frame, the three population types are "stacked" in one population variable, such that rows containing municipality codes store total populations but not urban or rural ones:

location population mun_code
municipality_1 10000 1
urban 6000 1
rural 4000 1
municipality_2 15000 2
urban 10000 2
rural 5000 2

I want to construct a second data frame with one variable for each population type:

location total_pop urban_pop rural_pop mun_code
municipality_1 10000 6000 4000 1
municipality_2 15000 10000 5000 2

I have tried using pivot_wider() from package tidyr, but can't seem to get the table to look the way I want.

Mikael Jagan
  • 9,012
  • 2
  • 17
  • 48

3 Answers3

2

We could create a new column based on the 'municipality' substring in the location and reshape to 'wide' format with pivot_wider

library(dplyr)
library(tidyr)
library(stringr)
df1 %>% 
  mutate(grp = case_when(str_detect(location, 'municipality') ~'total_pop', 
  TRUE ~ str_c(location, '_pop')), 
  location = case_when(grp == 'total_pop' ~ location)) %>% 
  fill(location) %>%
  pivot_wider(names_from = grp, values_from = population)

-output

# A tibble: 2 × 5
  location       mun_code total_pop urban_pop rural_pop
  <chr>             <int>     <int>     <int>     <int>
1 municipality_1        1     10000      6000      4000
2 municipality_2        2     15000     10000      5000

data

df1 <-structure(list(location = c("municipality_1", "urban", "rural", 
"municipality_2", "urban", "rural"), population = c(10000L, 6000L, 
4000L, 15000L, 10000L, 5000L), mun_code = c(1L, 1L, 1L, 2L, 2L, 
2L)), class = "data.frame", row.names = c(NA, -6L))
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Using replace and paste to get the names before pivot_wider

library(dplyr)
library(tidyr)

df %>% 
  mutate(loc = paste0(replace(location, grepl("municipality", location), "total"), "_pop"),
         location = paste0("municipality_", mun_code)) %>% 
  pivot_wider(names_from = loc, values_from = population)
# A tibble: 2 × 5
  location       mun_code total_pop urban_pop rural_pop
  <chr>             <int>     <int>     <int>     <int>
1 municipality_1        1     10000      6000      4000
2 municipality_2        2     15000     10000      5000
Andre Wildberg
  • 12,344
  • 3
  • 12
  • 29
0

Your problem is, your long format data isn't coded properly: 1. the municipality_ levels in location shouldn't be named differently depending on location, call them total as they are already identified in mun_code, and 2. if you want mun_code also as a factor, then make one.

muns <- grep('municipality_', dat$location)
dat$location[muns] <- 'total'

dat$municipality <- factor(dat$mun_code, labels=paste0('municipality_', unique(dat$mun_code)))

dat
#   location population mun_code   municipality
# 1    total      10000        1 municipality_1
# 2    urban       6000        1 municipality_1
# 3    rural       4000        1 municipality_1
# 4    total      15000        2 municipality_2
# 5    urban      10000        2 municipality_2
# 6    rural       5000        2 municipality_2

Then it's just

reshape(dat, idvar='mun_code', timevar='location', v.names='population', direction='wide')
#   mun_code   municipality population.total population.urban population.rural
# 1        1 municipality_1            10000             6000             4000
# 4        2 municipality_2            15000            10000             5000

or whatever fancy package you want to use to reshape your data.


Data:

dat <- structure(list(location = c("municipality_1", "urban", "rural", 
"municipality_2", "urban", "rural"), population = c(10000L, 6000L, 
4000L, 15000L, 10000L, 5000L), mun_code = c(1L, 1L, 1L, 2L, 2L, 
2L)), class = "data.frame", row.names = c(NA, -6L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110