0

In R, I have a data frame that is based on an IPUMS CPS extract in which I'd like to tabulate households based on residents' education level. In the df, each observation is a specific person. There exists a variable called "CPSID" which is a unique household identifier. That being said, all observations with the same value in the CPSID column live in the same household. For example, if three separate observations have the same value in the CPSID column, that means those three observations live in a household together. I've used ifelse statements to identify each observation's education level using the "Education" variable.

Now, I'd like to do two things:

  1. I want to add a new column that displays the total amount of people in a specific observation's household WITHOUT collapsing the data by a certain group. So, if three observations have the same CPSID, then "3" should be displayed in the new column for each of those observations sharing the CPSID.
  2. Right now I have a "1" indicating the education level attained and a "0" otherwise. Instead, I'd like that number to again display the total number of people in the household. For example, if an observation lives with three people, then I want a "3" to show up in the area that identifies the education level.

The screenshot of the R Studio df is what I currently have

what it looks like.

The screenshot of the excel is an example of what it should look like

what I want it to look like.

If anyone could help me out, that would be spectacular! Thanks a ton in advance.

Education_status <- data.frame(data3$YEAR, data3$MONTH, data3$CPSID, data3$HWTFINL, data3$EDUC99)
Education_status <- filter(Education_status, data3.EDUC99 !=0) 
Education_status["less than Associate's"] <- ifelse(Education_status$data3.EDUC99 <= 11, 1, 0)
Education_status["Associate's"] <-
 ifelse(Education_status$data3.EDUC99 == 12, 1, 0) +  
 ifelse(Education_status$data3.EDUC99 == 13, 1, 0) +  
 ifelse(Education_status$data3.EDUC99 == 14, 1, 0)
Education_status["Bachelor's"] <- ifelse(Education_status$data3.EDUC99 == 15, 1, 0)
Education_status["more than Bachelor's"] <- ifelse(Education_status$data3.EDUC99  15, 1, 0)
 
colnames(Education_status) <- c("Year", "Month", "CPSID", "Weight",
 "Education", "less than Associate's", "Associate's", "Bachelor's",
 "more than Bachelor's")
 
Edu_example <- dput(Education_status[1:10,])
Edu_Ex <-   structure(
     list(
       Year = c(2019, 2019, 2019, 2019, 2019, 2019, 2019,
                2019, 2019, 2019),
       Month = structure(
         c(1L, 1L, 1L, 1L, 1L, 1L,
           1L, 1L, 1L, 1L),
         labels = structure(
           1:12,
           .Names = c(
             "January",
             "February",
             "March",
             "April",
             "May",
             "June",
             "July",
             "August",
             "September",
             "October",
             "November",
             "December"
           )
         ),
         label = "Month",
         var_desc = "MONTH indicates the calendar month of the CPS interview.",
         class = c("haven_labelled",
                   "vctrs_vctr", "integer")
       ),
       CPSID = c(
         20181200000100,
         20181200000100,
         20171200000300,
         20171200000300,
         20190100000500,
         20190100000500,
         20171200000500,
         20181000000200,
         20171000000400,
         20171000000400
       ),
       Weight = c(
         1821.1212,
         1821.1212,
         1804.2193,
         1804.2193,
         3296.9061,
         3296.9061,
         2296.3113,
         2673.3924,
         1733.4595,
         1733.4595
       ),
       Education = structure(
         c(9L,
           11L, 11L, 10L, 13L, 10L, 11L, 11L, 11L, 7L),
         labels = c(
           NIU = 0L,
           `No school completed` = 1L,
           `1st-4th grade` = 4L,
           `5th-8th grade` = 5L,
           `9th grade` = 6L,
           `10th grade` = 7L,
           `11th grade` = 8L,
           `12th grade, no diploma` = 9L,
           `High school graduate, or GED` = 10L,
           `Some college, no degree` = 11L,
           `Associate degree, type of program not specified` = 12L,
           `Associate degree, occupational program` = 13L,
           `Associate degree, academic program` = 14L,
           `Bachelors degree` = 15L,
           `Masters degree` = 16L,
           `Professional degree` = 17L,
           `Doctorate degree` = 18L
         ),
         label = "Educational attainment, 1990",
         var_desc = "EDUC99 reports the respondent's highest level of educational attainment. Respondents without high school diplomas were
 to indicate the highest school grade they had completed, while those
 with high school diplomas were to indicate the highest diploma or
 degree they had obtained.",
         class = c("haven_labelled",
                   "vctrs_vctr", "integer")
       ),
       `less than Associate's` = c(1, 1,
                                   1, 1, 0, 1, 1, 1, 1, 1),
       `Associate's` = c(0, 0, 0, 0, 1, 0,
                         0, 0, 0, 0),
       `Bachelor's` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
       `more than Bachelor's` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
     ),
     row.names = c(NA,
                   10L),
     class = "data.frame"   )
r2evans
  • 141,215
  • 6
  • 77
  • 149
PhilP
  • 1
  • 2
  • Welcome to S.O. Can you please provide the example code / data / output as plain text as a minimum, or preferably using something like `dput(df)` so that possible answerers can recreate your R objects? See here for more info: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – thelatemail Jul 09 '21 at 00:22
  • Said differently, if you are asking for working-code for your question, you are asking us to generate code based on the *appearance* of your data, or perhaps you're asking us to transcribe your data. Worse, giving us a screenshot of code is just as bad. It is actually *more* work for you to take a screenshot of code and paste it into the question versus just copying the text and pasting into a [code block](https://stackoverflow.com/editing-help). – r2evans Jul 09 '21 at 00:37
  • @r2evans Hey man, I'm new to the platform. Thanks for taking the time to help me out. I've edited my question to give you the code but not 100% sure if that's what you need. If there's something wrong w the way I put it on SO, please let me know. – PhilP Jul 12 '21 at 22:33
  • Thank you! I formatted your code a little (i.e., added line breaks where needed) and noticed that your last `ifelse` is missing what I think is the `==`. If you're missing that in your real code, then that is certainly a syntax error. Also, in your previous edit, you removed the images; I've restored two of the three (the other is the code, which you corrected by adding), since it is still informative to show the desired effect. – r2evans Jul 13 '21 at 12:56
  • Can you clarify the relationship between your data's column `CPSID` and your expected-output columns `Person ID` and `HHID`? I think `Person ID` is not here (so ignore it), and `HHID` is `CPSID`, but that should be clearer. – r2evans Jul 13 '21 at 13:32
  • @r2evans So the excel is the example version of what I want the R Studio df to eventually look like. No need to worry about the Person ID, that's just a count of the observations - act like it's not there. What's important here is that the CPSID from the actual R Studio df is synonymous to the HHID in the excel. HHID was just written for simplicity. They serve the same purpose. So what I would like to do is add another column to my R Studio df that is equivalent to the "Number of ppl in HH" variable in the example excel table. I'm very unsure of the code for that. – PhilP Jul 13 '21 at 20:44

0 Answers0