1

I merged two datasets from the same survey. One is on the individual level selectedindividual and one on the household level, selectedhousehold. I have merged the two datasets using the following code (using left_join from dplyr):

mergeddf <- left_join(selectedhousehold, selectedindividual)  %>% group_by(shserial)  %>% slice (1)

The shserial variable is the primary key present in both datasets. Each household has a specific number. Each household can contain up to two individuals, which will therefore have the same shserial. Because I want to conduct my analysis on the household level, I did not want to just use merge() on the datasets (as this duplicated the observations in households with 2 individuals, for, for example, one of my dependent variables on the household level - GrossIncome).

I still have a problem though:

I have a variable on the individual level, WrkStat (with three levels, working, NWork, FTEduc) that I want to include in my model. The code that I used for merging the datasets only retained the first observations for two shserials with the same number (I assume this, at least, I could not figure out how slice() works exactly). This is not great for my analysis as I don't want to select one of two individuals in a household at random. To illustrate, the summary statistics of WrkStat in both the non-merged and the merged dataframes:

> summary(selectedindividual$WrkStat)
working  FTEduc   NWork    NA's 
    324     748    2455     201 

> summary(mergeddf$WrkStat)
working  FTEduc   NWork    NA's 
    251      77    2097       5 

As a solution, I figured I would create a new variable, WrkStat2 which combines the observations for two individuals in one household. I want to create this variable before merging the datasets.

I was hoping I could create this new variable on the basis of the mutual shserial number. However, I can't figure out how to do this.

EDIT:

The structure of my dataframe:

selectedindividual <- structure(list(`shserial` = c(1010574, 1010574, 
1011104, 1011104, 1011109, 1011109, 1011134, 1011134, 1011142, 
1011143, 1011148, 1011148, 1011154, 1011154, 1011156, 1011171, 
1011171, 1011174, 1011174, 1011182), `WrkStat` = structure(c(3L, 
2L, 3L, 2L, 3L, NA, 1L, NA, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 1L, 2L, 
3L, NA, 3L), .Label = c("working", "FTEduc", "NWork"), class = "factor")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

Gives this output:

   shserial WrkStat
      <dbl> <fct>  
 1  1010574 NWork  
 2  1010574 FTEduc 
 3  1011104 NWork  
 4  1011104 FTEduc 
 5  1011109 NWork  
 6  1011109 NA     
 7  1011134 working
 8  1011134 NA     
 9  1011142 NWork  
10  1011143 NWork  
11  1011148 NWork  
12  1011148 FTEduc 
13  1011154 NWork  
14  1011154 FTEduc 
15  1011156 NWork  
16  1011171 working
17  1011171 FTEduc 
18  1011174 NWork 

I would like this output:

   shserial WrkStat2
      <dbl> <fct>  
 1  1010574 NWork/FTEduc  
 2  1011104 NWork/FTEduc 
 3  1011109 NWork     
 4  1011134 working
 5  1011142 NWork  
 6  1011143 NWork  
 7  1011148 NWork/FTEduc
 8  1011154 NWork/FTEduc 
 9  1011156 NWork  
10  1011171 working/FTEduc
11  1011174 NWork 

(This also removes the NA's that are not on it's own (so not a combined WrkStat), although I think it would also be fine if all the NA's would be removed in this process).

Another edit:

WrkStat2 should have the following labels:

"working/working",
"working/NWork",
"working/FTEduc",
"NWork/NWork",
"NWork/FTEduc",
"FTEduc/FTEduc",
"working",
"NWork",
"FTEduc"

I'm sorry if something is not clear or if the whole things is impossible (let me know). I am not great at thinking logically and have struggled with this for a few days.

Laura
  • 13
  • 4
  • You should be able to use mutate and ifelse like they do in [this example](https://stackoverflow.com/questions/24459752/can-dplyr-package-be-used-for-conditional-mutating). – Lyndon Walker Apr 18 '20 at 22:30
  • Please add data using `dput` and show the expected output for the same. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Apr 19 '20 at 07:29
  • @Ronak Shah I've added the the data and expected output, plus some more information on what I want to achieve. I hope it is more clear and possible to reproduce my problem. – Laura Apr 19 '20 at 12:35
  • @Lyndon Walker Thank you. I saw this example but could not figure out how to apply it to my data. – Laura Apr 19 '20 at 12:36

1 Answers1

0

We can remove NA values and paste WrkStat for each shserial.

This can be done in base R :

aggregate(WrkStat~shserial, selectedindividual, function(x) 
           paste0(na.omit(x), collapse = "/"))

#   shserial        WrkStat
#1   1010574   NWork/FTEduc
#2   1011104   NWork/FTEduc
#3   1011109          NWork
#4   1011134        working
#5   1011142          NWork
#6   1011143          NWork
#7   1011148   NWork/FTEduc
#8   1011154   NWork/FTEduc
#9   1011156          NWork
#10  1011171 working/FTEduc
#11  1011174          NWork
#12  1011182          NWork

dplyr :

library(dplyr)
selectedindividual %>% 
   group_by(shserial) %>% 
   summarise(WrkStat2 = paste0(na.omit(WrkStat), collapse = "/"))

Or in data.table :

library(data.table)
setDT(selectedindividual)[, (WrkStat = paste0(na.omit(WrkStat), collapse = "/")), 
                            shserial]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213