Counting the number of changes of a categorical variable during repeated measurements within a category

Question

I'm working with a dataset about migration across the country with the following columns:

i   birth   gender  race    region  urban   wage    year  educ
1   58      2        3      1       1       4620    1979   12
1   58      2        3      1       1       4620    1980   12
1   58      2        3      2       1       4620    1981   12
1   58      2        3      2       1       4700    1982   12

.....

i   birth   gender  race    region  urban   wage    year  educ
45   65      2        3      3       1      NA       1979   10
45   65      2        3      3       1      NA       1980   10
45   65      2        3      4       2      11500    1981   10
45   65      2        3      1       1      11500    1982   10

i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.

How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)

A simplified version of expected output:

i  changes in region
1   1
...
45  2

i  changes in urban
1   0
...
45  2

I would then like to sum up the number of changes for region and urban.

I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.

Here's a part of the data for i=4.

i   birth gender    race    region  urban   wage    year    educ
4   62      2        3        1      1       NA     1979    9
4   62      2        3        NA     NA      NA     1980    9
4   62      2        3        4      1       0      1981    9
4   62      2        3        4      1       1086   1982    9
4   62      2        3        1      1       70     1983    9
4   62      2        3        1      1       0      1984    9
4   62      2        3        1      1       0      1985    9
4   62      2        3        1      1       7000   1986    9
4   62      2        3        1      1      17500   1987    9
4   62      2        3        1      1      21320   1988    9
4   62      2        3        1      1      21760   1989    9
4   62      2        3        1      1         0    1990    9
4   62      2        3        1      1         0    1991    9
4   62      2        3        1      1      30500   1992    9
4   62      2        3        1      1      33000   1993    9
4   62      2        3       NA     NA        NA    1994    9
4   62      2        3        4      1      35000   1996    9

Here, output should be:

i change_reg   change_urban
4  3            0

Welcome to stackoverflow! Your question is unclear, please read and edit your question according to [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) so that other users can help you. Also, add expected output. — pogibas, Nov 16 '19 at 17:31
Please be specific in your expected output. Instead of mentioning "summary stats," provide the exact values of these aggregate counts for changes in region or urban based on your example data. The more detailed and explanatory your example can be, the better. — Ben, Nov 16 '19 at 18:20
I see that change in urban is 1 for i=45, though it looks like 2 changes (from 1 to 2, then from 2 to 1). Are you looking for number of unique values for region and urban since person was in urban of 1 previously? Or was this a mistake? — Ben, Nov 16 '19 at 20:44
@Ben whoops that was a mistake. should be 2. Not looking for unique changes, will edit — m_j, Nov 16 '19 at 20:46

Ben · Accepted Answer · 2019-11-16T22:39:53.360

0

Here is something I hope will get your closer to what you need.

First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.

Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.

Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.

library(dplyr)
library(tidyr)

df_tot <- df %>%
  drop_na(region, urban) %>%
  group_by(i) %>%
  mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
         urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
  summarize(tot_region = sum(reg_change),
            tot_urban = sum(urban_change))

# A tibble: 3 x 3
      i tot_region tot_urban
  <int>      <dbl>     <dbl>
1     1          1         0
2     4          3         0
3    45          2         2

Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)

colSums(df_tot[-1])

tot_region  tot_urban 
         6          2

edited Nov 16 '19 at 22:39

answered Nov 16 '19 at 20:51

Ben

28,684
5
23
45

Thanks, it mostly worked, but the NA's are still messing with my output # A tibble: 12,686 x 3 i tot_region tot_urban 1 1 NA NA 2 2 0 0 3 3 1 0 4 4 NA NA 5 5 NA NA 6 6 2 2 7 7 NA NA 8 8 0 2 9 9 0 0 10 10 NA NA # … with 12,676 more rows I'll have to figure out how to deal with the NA's – m_j Nov 16 '19 at 21:08
You could use `drop_na` from `tidyr` before looking at changes - would that handle what you need? See above edited answer. – Ben Nov 16 '19 at 21:30
1

That worked, thanks! drop_na is very useful - will keep that in mind for future purposes. – m_j Nov 16 '19 at 21:36
One last question... how do I sum the counts of changes for 'tot_region' and 'tot_urban'? – m_j Nov 16 '19 at 22:29
is that a grand total? so for example, tot_region_total would be 6 and tot_urban_total would be 2 in the example that only includes i of 1, 4, and 45? – Ben Nov 16 '19 at 22:31
Yep, a grand total. But for i of 12,000+ – m_j Nov 16 '19 at 22:33
see additional edits - you can use `colSums` to get sums of different selected columns. you could also use other options including `tidyverse` but this might be easiest. – Ben Nov 16 '19 at 22:40

Counting the number of changes of a categorical variable during repeated measurements within a category

1 Answers1

Linked