I have some data in 'untidy' format - with 'age' embedded in the variable name. Using dplyr, I want to create a 'tidy' format dataset in which the keys are datazone, year, and age group, and also where the lower and upper ages within the age group are separate variables.
All of this is fine, except the final step takes much longer than I'd like it to. Is there a faster way of doing this that's still about as 'readable'?
Full reproducible example (using repmis
to pull the file)
require(repmis)
require(stringr)
require(tidyr)
require(plyr)
require(dplyr)
persons <- source_DropboxData(
file="persons.csv",
key="vcz7qngb44vbynq"
) %>%
tbl_df() %>%
select(datazone, year,
contains("hspeop")
)
names(persons) <- names(persons) %>%str_replace_all( "GR.hspeop", "count_both_")
persons <- persons %>% gather(age_group, count, -datazone, -year)
persons <- persons %>% mutate(gender="both", age_group=str_replace_all(age_group, "count_both_", ""))
persons$age_group <- persons$age_group %>% revalue(
c(
"1619" = "16_19",
"2024" = "20_24",
"2529" = "25_29",
"3034" = "30_34",
"3539" = "35_39",
"4044" = "40_44",
"4549" = "45_49",
"5054" = "50_54",
"5559" = "55_59",
"6064" = "60_64",
"6569" = "65_69",
"7074" = "70_74",
"7579" = "75_79",
"8084" = "80_84",
"85over" = "85_100"
)
)
# deal with "" separately as revalue can't cope
persons$age_group[nchar(persons$age_group)==0] <- "all"
persons_by_age <- persons %>% filter(grepl("_", age_group)) # this is how to filter by contents of age_group
persons_by_age <- persons_by_age %>%
group_by(age_group) %>%
mutate(
lower_age = str_split(age_group, "_")[[1]][1] %>% as.numeric(),
upper_age = str_split(age_group, "_")[[1]][2] %>% as.numeric()
)
Obviously I'm creating the same object twice in mutate, so potential for speed doubling there. I also thought that group_by would mean that the operation would only have to be completed once per age group, but it seems to do it for each row. Would summarising by age group, mutating, then joining be a faster approach, for example?
Edit
The code above already creates the output, but much slower than I'd like.
A couple of examples of the final output:
> persons_by_age
Source: local data frame [5,854,500 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 1996 0 8 both 0 0
2 S01000002 1996 0 4 both 0 0
3 S01000003 1996 0 18 both 0 0
4 S01000004 1996 0 4 both 0 0
5 S01000005 1996 0 17 both 0 0
6 S01000006 1996 0 1 both 0 0
7 S01000007 1996 0 9 both 0 0
8 S01000008 1996 0 10 both 0 0
9 S01000009 1996 0 8 both 0 0
10 S01000010 1996 0 9 both 0 0
.. ... ... ... ... ... ... ...
> persons_by_age %>% filter(year==2000 & gender=="male" & lower_age > 30)
Source: local data frame [71,555 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 2000 35_39 34 male 35 39
2 S01000002 2000 35_39 41 male 35 39
3 S01000003 2000 35_39 61 male 35 39
4 S01000004 2000 35_39 43 male 35 39
5 S01000005 2000 35_39 43 male 35 39
6 S01000006 2000 35_39 24 male 35 39
7 S01000007 2000 35_39 34 male 35 39
8 S01000008 2000 35_39 23 male 35 39
9 S01000009 2000 35_39 30 male 35 39
10 S01000010 2000 35_39 37 male 35 39
.. ... ... ... ... ... ... ...
> persons_by_age %>% filter(year==2000 & gender=="female" & lower_age > 30)
Source: local data frame [71,555 x 7]
datazone year age_group count gender lower_age upper_age
1 S01000001 2000 35_39 37 female 35 39
2 S01000002 2000 35_39 30 female 35 39
3 S01000003 2000 35_39 58 female 35 39
4 S01000004 2000 35_39 46 female 35 39
5 S01000005 2000 35_39 28 female 35 39
6 S01000006 2000 35_39 29 female 35 39
7 S01000007 2000 35_39 33 female 35 39
8 S01000008 2000 35_39 25 female 35 39
9 S01000009 2000 35_39 36 female 35 39
10 S01000010 2000 35_39 38 female 35 39
.. ... ... ... ... ... ... ...