0

I have some data in 'untidy' format - with 'age' embedded in the variable name. Using dplyr, I want to create a 'tidy' format dataset in which the keys are datazone, year, and age group, and also where the lower and upper ages within the age group are separate variables.

All of this is fine, except the final step takes much longer than I'd like it to. Is there a faster way of doing this that's still about as 'readable'?

Full reproducible example (using repmis to pull the file)

require(repmis)
require(stringr)
require(tidyr)
require(plyr)
require(dplyr)


persons <- source_DropboxData(
    file="persons.csv",
    key="vcz7qngb44vbynq"
    ) %>% 
    tbl_df() %>%
    select(datazone, year,
           contains("hspeop")
    )

names(persons) <- names(persons) %>%str_replace_all( "GR.hspeop", "count_both_")
persons <- persons %>% gather(age_group, count, -datazone, -year)

persons <- persons %>% mutate(gender="both", age_group=str_replace_all(age_group, "count_both_", ""))
persons$age_group <- persons$age_group %>% revalue(
    c(
        "1619" = "16_19",
        "2024" = "20_24",
        "2529" = "25_29", 
        "3034" = "30_34",
        "3539" = "35_39",
        "4044" = "40_44",
        "4549" = "45_49",
        "5054" = "50_54",
        "5559" = "55_59",
        "6064" = "60_64",
        "6569" = "65_69",
        "7074" = "70_74",
        "7579" = "75_79",
        "8084" = "80_84",
        "85over" = "85_100"
    )
)

# deal with "" separately as revalue can't cope
persons$age_group[nchar(persons$age_group)==0] <- "all"

persons_by_age <- persons %>% filter(grepl("_", age_group)) # this is how to filter by contents of age_group

persons_by_age <- persons_by_age %>% 
    group_by(age_group) %>% 
    mutate(
        lower_age = str_split(age_group, "_")[[1]][1] %>% as.numeric(),
        upper_age = str_split(age_group, "_")[[1]][2] %>% as.numeric()
    )

Obviously I'm creating the same object twice in mutate, so potential for speed doubling there. I also thought that group_by would mean that the operation would only have to be completed once per age group, but it seems to do it for each row. Would summarising by age group, mutating, then joining be a faster approach, for example?

Edit

The code above already creates the output, but much slower than I'd like.

A couple of examples of the final output:

> persons_by_age
Source: local data frame [5,854,500 x 7]

    datazone year age_group count gender lower_age upper_age
1  S01000001 1996         0     8   both         0         0
2  S01000002 1996         0     4   both         0         0
3  S01000003 1996         0    18   both         0         0
4  S01000004 1996         0     4   both         0         0
5  S01000005 1996         0    17   both         0         0
6  S01000006 1996         0     1   both         0         0
7  S01000007 1996         0     9   both         0         0
8  S01000008 1996         0    10   both         0         0
9  S01000009 1996         0     8   both         0         0
10 S01000010 1996         0     9   both         0         0
..       ...  ...       ...   ...    ...       ...       ...
> persons_by_age %>% filter(year==2000 & gender=="male" & lower_age > 30)
Source: local data frame [71,555 x 7]

    datazone year age_group count gender lower_age upper_age
1  S01000001 2000     35_39    34   male        35        39
2  S01000002 2000     35_39    41   male        35        39
3  S01000003 2000     35_39    61   male        35        39
4  S01000004 2000     35_39    43   male        35        39
5  S01000005 2000     35_39    43   male        35        39
6  S01000006 2000     35_39    24   male        35        39
7  S01000007 2000     35_39    34   male        35        39
8  S01000008 2000     35_39    23   male        35        39
9  S01000009 2000     35_39    30   male        35        39
10 S01000010 2000     35_39    37   male        35        39
..       ...  ...       ...   ...    ...       ...       ...
> persons_by_age %>% filter(year==2000 & gender=="female" & lower_age > 30)
Source: local data frame [71,555 x 7]

    datazone year age_group count gender lower_age upper_age
1  S01000001 2000     35_39    37 female        35        39
2  S01000002 2000     35_39    30 female        35        39
3  S01000003 2000     35_39    58 female        35        39
4  S01000004 2000     35_39    46 female        35        39
5  S01000005 2000     35_39    28 female        35        39
6  S01000006 2000     35_39    29 female        35        39
7  S01000007 2000     35_39    33 female        35        39
8  S01000008 2000     35_39    25 female        35        39
9  S01000009 2000     35_39    36 female        35        39
10 S01000010 2000     35_39    38 female        35        39
..       ...  ...       ...   ...    ...       ...       ...
JonMinton
  • 1,239
  • 2
  • 8
  • 26
  • It's much easier to help you if you add a minimal example in the post together with the desired output. – Henrik Feb 05 '15 at 11:19
  • Thanks -I've added some sample outputs. The question's how to produce this much quicker. perhaps tidyr's `separate` function could be the answer. – JonMinton Feb 05 '15 at 11:31
  • As you may have noted, I asked you to provide a _minimal_ example. Few people would consider `local data frame [5,854,500 x 7]` minimal. Please read [**this**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Henrik Feb 05 '15 at 11:36
  • Sorry - in this case the size of the dataframe is an important factor as it's the motivation for looking for computationally faster approaches. Using `source_DropboxData` the code should work on any machine. I could create a fake dataset of a similar dimension but that would likely be no shorter in terms of code. Thanks anyway. – JonMinton Feb 05 '15 at 11:50
  • I get your point. But trust me, people a generally much willing to help if you provide _minimal_ sample data set (which then easily can be extended) on which they can try their code. A _self-contained_ example also has a better lifespan than a dropbox link. – Henrik Feb 05 '15 at 11:55
  • 3
    @JonMinton Having a quick look, `separate()` may work here. `separate(persons_by_age, age_group, c("lower", "upper"), sep = "_")` – jazzurro Feb 05 '15 at 12:09

1 Answers1

1

You can try this:

persons_by_age<-persons_by_age %>% 
  group_by(age_group) %>% 
  do(cbind(.,matrix(rep(unlist(strsplit(as.character(.[1,3]), "_")),nrow(.)),ncol=2,byrow=TRUE)))

The . allows you to access the groups in group_by

For each group, the first row of the age_group column (.[1,3]) is split and lower and upper are made into a vector, which is then repeated for as many rows as there is in the group.

The obtained matrix is then bound to the group. It ran in a few seconds.

separate as suggested by @jazzurro is much easier though: separate(persons_by_age, age_group, c("lower", "upper"), sep = "_",remove=FALSE)

NicE
  • 21,165
  • 3
  • 51
  • 68