From long to wide with unbalanced dataset and one row per observation

Question

I've tried many things trying to figure out to go from long to wide, but I cannot get one row per observation. It gives many NA values since my data is unbalanced (I cannot just shift all values one row up, etc.)

This is a part of my data:

structure(list(employees = c(384, 432, 624, 334, 356, 338, 348, 
1122, 1110, 1492), profit_margin = c(-0.14684, -0.85298, -0.58792, 
-0.38872, -1.30312, -0.86866, -0.6363, -1.925, 0.567, 3.984), 
    RD_expenses = c(8946.414554, 9977.75638, 43326.90616, 48870.14658, 
    35022.10866, 39584.25952, 32259.2173, 6303.95, 6812.46, 14993.39
    ), RD_intensity = c(7.10910850621956, 8.98811378416267, 15.6492601635234, 
    17.6773777378817, 13.1744528168514, 14.3544852219875, 11.2624231565094, 
    0.500071500320608, 0.559723756230354, 1.36999818636439), 
    sales = c(125844.3945, 111010.5704, 276862.329, 276455.8596, 
    265833.4972, 275762.3064, 286432.2966, 1260609.732, 1217111.106, 
    1094409.478), treated = c("1", "1", "1", "1", "1", "1", "1", 
    "1", "1", "1"), year = c(2013L, 2014L, 2015L, 2016L, 2017L, 
    2018L, 2019L, 2015L, 2016L, 2017L), id = c(1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 2L, 2L, 2L), company = c("ALLERGAN PUBLIC LIMITED COMPANY", 
    "ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY", 
    "ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY", 
    "ALLERGAN PUBLIC LIMITED COMPANY", "ALLERGAN PUBLIC LIMITED COMPANY", 
    "ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC.", "ALPINE ELECTRONICS, INC."
    )), row.names = c(NA, -10L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x000001c71d471ef0>)

I've tried this:

test %>%   group_by(id, company) %>%   dplyr::mutate(row = row_number()) %>%   tidyr::pivot_wider(names_from = year, values_from = c("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated"))

But this gives many NA values and not one row per observation, like this:

1   ALLERGAN PUBLIC LIMITED COMPANY 1   384 NA  NA  NA  NA  
1   ALLERGAN PUBLIC LIMITED COMPANY 2   NA  432 NA  NA  NA  
1   ALLERGAN PUBLIC LIMITED COMPANY 3   NA  NA  624 NA  NA  
1   ALLERGAN PUBLIC LIMITED COMPANY 4   NA  NA  NA  334 NA  
1   ALLERGAN PUBLIC LIMITED COMPANY 5   NA  NA  NA  NA  356 
1   ALLERGAN PUBLIC LIMITED COMPANY 6   NA  NA  NA  NA  NA  
1   ALLERGAN PUBLIC LIMITED COMPANY 7   NA  NA  NA  NA  NA  
2   ALPINE ELECTRONICS, INC.    1   NA  NA  1122    NA  NA  
2   ALPINE ELECTRONICS, INC.    2   NA  NA  NA  1110    NA  
2   ALPINE ELECTRONICS, INC.    3   NA  NA  NA  NA  1492

Also, I do not have exactly 7 observations per company, so that makes it a bit harder.

I have also tried this:

test %>% 
  group_by(id) %>% 
  dplyr::mutate(Visit = 1:n()) %>%
gather("employees", "profit_margin", "RD_expenses", "RD_intensity", "sales", "treated", "year", key = variable, value = number) %>%
unite(combi, variable, Visit) %>%
 spread(combi, number)

But that gives even more strange results, with columns up till _31, where the maximum of observations of 1 company (or id) is 7.

Any ideas? I need it in order to use matching!

Thank you

score 1 · Answer 1 · answered Jun 23 '21 at 18:48

You can just use the reshape() function in base R.

reshape(d, direction = "wide", timevar = "year", idvar = c("id", "company"))

There will be NAs for any years that the firm doesn't have data. Include any time-fixed variables (e.g., country or strategy, if measured) in idvar.

score 0 · Accepted Answer · answered Jun 23 '21 at 10:38

0

I think you can skip creation of row column all together.

tidyr::pivot_wider(df, names_from = year, 
                  values_from = c(employees, profit_margin, RD_expenses, RD_intensity, sales, treated))

answered Jun 23 '21 at 10:38

Ronak Shah

377,200
20
156
213

Thanks, but this gives all values either or – Dionne van den Aker Jun 23 '21 at 10:48
It doesn't give any such values for the data that you have shared. Is it possible to have access to complete data somehow? – Ronak Shah Jun 23 '21 at 11:33

From long to wide with unbalanced dataset and one row per observation

2 Answers2