2

Problem: How to generate a new dataset from an existing one, basically it is a reshape from long to wide, but a bit more complicated.

I have a non-trivial amount of data, of which I offer a simplified version below:

id      <- c(1,2,3,4,5)
job     <- c(11,12,11,12,13)
sex     <- c(0,1,0,1,0)
country <- c(1,2,3,2,1)
data    <- data.frame(id, job, sex, country)

Desired data: I'd like to have a dataset of the jobs and their occupants, like this: in job=11, I have 2 people of sex==0 and 1 born in country==1 and 1 born in country==3

So, the new dataset would be like this:

  jobs jobs_sex0 jobs_sex1 jobs_country1 jobs_country2 jobs_country3
1   11         2         0             1             0             0
2   12         0         2             0             2             0
3   13         1         0             0             0             1

I have an intuition that this can be achieved with tapply, but I am not sure how.

I have tried this, and it does not work:

tapply(occupation[sex==1],sex[sex==1], sum)
aggregate(occupation, list(sex), fun=sum)

Edit: I think this Q is not a duplicate of Transpose / reshape dataframe without "timevar" from long to wide format, as the problem I have is that I need to reshape different factor variables with different number of levels... Applying the answer from the supposedly duplicated Q does not work...

gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
cremorna
  • 374
  • 1
  • 9

2 Answers2

2

I wonder if the tableone package might help you here. Consider:

data$sex     <- factor(data$sex)      # note that you will have to ensure these are factors
data$country <- factor(data$country)

library(tableone)
tab1 <- CreateTableOne(vars=c("sex", "country"), strata="job", data=data)
print(tab1, showAllLevels=TRUE, test=FALSE, explain=FALSE)
#              Stratified by job
#               level 11         12         13        
#   n                 2          2          1         
#   sex         0     2 (100.0)  0 (  0.0)  1 (100.0) 
#               1     0 (  0.0)  2 (100.0)  0 (  0.0) 
#   country     1     1 ( 50.0)  0 (  0.0)  1 (100.0) 
#               2     0 (  0.0)  2 (100.0)  0 (  0.0) 
#               3     1 ( 50.0)  0 (  0.0)  0 (  0.0) 

If you want to do subsequent processing, the above solution will be less workable. Here is a coded solution, but you will have to adapt it for each situation:

out.data <- t(sapply(split(data, job), function(df){ 
                       with(df, c(table(sex), table(country))) }))
out.data <- data.frame(job=rownames(out.data), out.data)
rownames(out.data)      <- NULL
colnames(out.data)[2:6] <- c(paste("sex",     levels(data$sex),     sep="_"),
                             paste("country", levels(data$country), sep="_") )
out.data
#   job sex_0 sex_1 country_1 country_2 country_3
# 1  11     2     0         1         0         1
# 2  12     0     2         0         2         0
# 3  13     1     0         1         0         0
gung - Reinstate Monica
  • 11,583
  • 7
  • 60
  • 79
  • Thanks @gung! This definitely solves one part of my question. The second part is: how can I transform tab1 from a list to a dataframe, so that I can use it as a panel dataset and perform analyses? Simply coercing it with as.data.frame does not work... – cremorna Jul 17 '17 at 16:43
  • @cremorna, I'm not sure how to do that easily. The problem is you have the `" (100.0)"` in there. I thought there was a way to not get that, but I'm not seeing it. It would take a bit of coding to make that happen. To create this for subsequent processing, there may be a better method. – gung - Reinstate Monica Jul 17 '17 at 16:47
  • It works!! Thank you very much! This does, however, prove to me that we need a version of Stata's "collapse" in R, that can handle the function "count" for factor levels! Just in case anyone would like to program that into a package :) – cremorna Jul 17 '17 at 17:23
  • You're welcome, @cremorna. I don't know Stata, but I suspect what you want already exists somewhere. – gung - Reinstate Monica Jul 17 '17 at 17:24
0

I think I have found another very simple solution, with the help of some friends :)

data
  id job sex country
1  1  11   2       1
2  2  12   1       2
3  3  11   2       3
4  4  12   1       2
5  5  13   2       1

data$sex <- as.factor(data$sex)
data$country <- as.factor(data$country)

agg_data <- aggregate((model.matrix(~.-1, data[,-(1:2)])), by =         
list(unique.jobs = data$job), FUN=sum)
agg_data

  unique.jobs sex1 sex2 country1 country2 country3
1          11    0    2        1        0        1
2          12    2    0        0        2        0
3          13    0    1        1        0        0
cremorna
  • 374
  • 1
  • 9