0

I am new to R and perhaps my question is very silly. First of all, I would like to describe my data and then the problem.

I have (unbalanced) panel of monthly household consumption data from Jan 2000 to Dec 2010. In Jan 2005, consumption tax increased from 7% to 10%. At this moment, I am trying to understand the data more and get very deeper understanding of the data.

For this purpose, I would like to take an average of 12 months consumption before the tax increase, that is Jan 2004 to Dec 2004. Then using this computed mean, I would like to classify households into 4 categories: first category USD 1000-2500, second category USD 2501 - 5000, third category USD 5001-7500, and fourth category USD 7501 - 10000. (in data set minimum monthly consumption expenditure is USD 1000 and max is USD 10,000.00)

Using the above categorization criteria, I would like to check by how much expenditure has increased in Jan 2005, feb 2005 through dec 2010 for each category. I have been struggling on this issue for about 3 weeks and I could not figure how to even start. I would be highly grateful any suggestions and help. Thank you so much in advance.

I am using confidential data from tax office and I am not able to share the same dataset. However, I created the data that is similar to it:

data2 <- structure(list(id = c(1223, 1223, 1223, 1223, 1223, 1223, 1223, 
1223, 1223, 1223, 1223, 1223, 1223, 1223, 1223, 1223, 1223, 1223, 
1223, 1223, 1223, 1223, 1223, 1223, 1224, 1224, 1224, 1224, 1224, 
1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 
1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224), con = c(1954, 
1965, 2220, 1789, 2855, 2192, 1028, 2745, 1190, 2892, 1941, 1045, 
1778, 1660, 1037, 1259, 1655, 1429, 1617, 1927, 1105, 1948, 1929, 
1673, 7309, 9420, 9849, 7824, 7522, 7448, 7370, 6717, 9024, 7635, 
9316, 5173, 9071, 5997, 6315, 6636, 9978, 8077, 9170, 5440, 9442, 
6668, 5732, 8460), year = c(2004, 2004, 2004, 2004, 2004, 2004, 
2004, 2004, 2004, 2004, 2004, 2004, 2005, 2005, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2004, 2004, 2004, 2004, 
2004, 2004, 2004, 2004, 2004, 2004, 2004, 2004, 2005, 2005, 2005, 
2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2005), month = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11, 12)), row.names = c(NA, -48L), class = c("tbl_df", 
"tbl", "data.frame"))
stefan
  • 90,330
  • 6
  • 25
  • 51
Angelina
  • 71
  • 6
  • 1
    Please include a reproducible example of your data, there's [some tips here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to help you – stlba Aug 26 '20 at 13:50
  • Thank you stlba. I edited my original question and provided reproducible example data. I am sorry for this error, I am new to Stackflow too and learning many new things. Pls accept my apologies for the mistake in my original post. Thank you. – Angelina Aug 26 '20 at 14:56
  • No worries. When you're comparing the data from 2005 on to 2004, do you want to compare each family to itself or the mean of the whole group or something else? – stlba Aug 26 '20 at 15:27

1 Answers1

1

FYI, I use the dplyr package (part of the tidyverse) throughout this answer. I've also made the assumption that you want to compare the post-2004 consumption for each category as a whole, to the category average in 2004, rather than on an individual family basis. If this isn't correct let me know and I can alter the answer.

First, I make a separate table for the 2004 data only, and use this to calculate the mean consumption per ID for the whole year (using summarise()), then make a new column with the category each ID falls in (using mutate() and case_when()), and then calculate the mean consumption for each category.

data2_2004 <-
     data2 %>%
     filter(year == 2004) %>%
     group_by(id) %>%
     summarise(mean_con_2004_id = mean(con)) %>%
     mutate(household_category = case_when(between(mean_con_2004_id, 1000, 2500) ~ "cat1",
                                           between(mean_con_2004_id, 2501, 5000) ~ "cat2",
                                           between(mean_con_2004_id, 5001, 7500) ~ "cat3",
                                           between(mean_con_2004_id, 7501, 10000) ~ "cat4")) %>%
     group_by(household_category) %>%
     mutate(mean_con_2004_category = mean(mean_con_2004_id))
> data2_2004
# A tibble: 2 x 4
# Groups:   household_category [2]
     id mean_con_2004_id household_category mean_con_2004_category
  <dbl>            <dbl> <chr>                               <dbl>
1  1223            1985. cat1                                1985.
2  1224            7884. cat4                                7884.

Then, I filter your dataframe for data after 2004, and use left_join() to merge it with the 2004 data to add the mean consumption per family ID, category, and mean consumption per category.

data2_post2004 <- data2 %>%
     filter(year > 2004) %>%
     left_join(., data2_2004) 
> data2_post2004
# A tibble: 24 x 7
      id   con  year month mean_con_2004_id household_category mean_con_2004_category
   <dbl> <dbl> <dbl> <dbl>            <dbl> <chr>                               <dbl>
 1  1223  1778  2005     1            1985. cat1                                1985.
 2  1223  1660  2005     2            1985. cat1                                1985.
 3  1223  1037  2005     3            1985. cat1                                1985.
 4  1223  1259  2005     4            1985. cat1                                1985.
 5  1223  1655  2005     5            1985. cat1                                1985.
 6  1223  1429  2005     6            1985. cat1                                1985.
 7  1223  1617  2005     7            1985. cat1                                1985.
 8  1223  1927  2005     8            1985. cat1                                1985.
 9  1223  1105  2005     9            1985. cat1                                1985.
10  1223  1948  2005    10            1985. cat1                                1985.
# ... with 14 more rows

From here, you can do whatever comparisons you want to. For example, to compare the mean consumption in each category each month to the 2004 average for that category:

data2_post2004_summary <- data2_post2004 %>% 
     group_by(household_category, year, month, mean_con_2004_category) %>%
     summarise(mean_con = mean(con)) %>% 
     mutate(diff_2004 = mean_con - mean_con_2004_category) %>%
     mutate(percent_diff_2004 = diff_2004/mean_con_2004_category * 100)

If you want to plot the data instead, you can convert the year + month columns to a date column before plotting.

data2_post2004_summary %>%
     mutate(date = as.Date(paste(year, month, "01", sep = "-"))) %>%
     ggplot(aes(x = date, y = mean_con)) +
     geom_line() +
     geom_line(aes(y = mean_con_2004_category), linetype = "dotted") +
     facet_wrap(facets = vars(household_category))

a plot of the mean consumption over time, with a dotted line for the 2004 average, faceted by category

stlba
  • 667
  • 3
  • 13
  • Thank you so much stlba. I am so grateful for your help. Yes, this was exactly what I was trying to do but never succeed. Now I am able to understand my data and track how policy change has affected the consumption. Once again, thank you. – Angelina Aug 26 '20 at 23:54
  • Great! Glad I could help. You can choose this as the answer to mark your post solved if it's all working :) – stlba Aug 27 '20 at 12:45