2

I've struggled a lot to get this right in R. My data looks like this (real data has more than 120k observations):

df <- data.frame(
  Input = c(1,2,3,4,4,5,1,3,4),
  Output = c(91,91,91,91,91,91,92,92,92)
)

df
  Input Output
1     1     91
2     2     91
3     3     91
4     4     91
5     4     91
6     5     91
7     1     92
8     3     92
9     4     92

The numbers are codes for physical stuff. Key here is that product 91 uses 5 inputs whereas product 92 uses only 3, AND there are duplicates in 91 (two 4s)

I want the data frame to have outputs as column names and inputs as values (disregarding duplicates):

       91         92
       1          1
       2          NA
       3          3
       4          4
       5          NA

So, I did a reshape from long to wide using

df2 <- reshape(df, idvar = "Input", v.names = "Output", timevar = "Output", direction = "wide", sep = "_")

This takes lots of time and haven't been able to make other code from this post to work. It yields an intermediate step:

  Input Output_91 Output_92
1     1        91        92
2     2        91        NA
3     3        91        92
4     4        91        92
5     5        91        NA

Then, all I have to do is to replace each Output column with the first column, except if NA. I can do this trivially for one column at a time. For instance:

df2$Output_92[!is.na(df2$Output_92)] <- df2$Input[!is.na(df2$Output_92)]

I've been trying to make a loop over columns to iterate over all 2+columns, mimicking the above command. Something like:

for(i in colnames(df2)){
        df2[!is.na(i)][i] <- df2$Input[!is.na(i)][i]
      }

This does not work.

So, in principle I need help either only in this loop (title of question). But hints optimising the reshaping, or perhaps a simpler way to do the whole thing are more than welcome.

UPDATE: I realised duplicates was key to my problem, since without these, standard solutions in the linked post work fine. Updated question accordingly. The answer below helps with cases where ID has duplicates.

luchonacho
  • 6,759
  • 4
  • 35
  • 52
  • 1
    You could do `df %>% mutate(ID = Input) %>% pivot_wider(names_from = "Output", values_from = "Input") %>% select(-ID)`. – tmfmnk Oct 19 '21 at 14:10
  • @tmfmnk Thanks. Perhaps my example can not fully reproduce my data's features. I remember trying `pivot_wider`, but without the `mutate` or `select` additions (as in the linked post). Your code works fine for the example above but in my real data I get some cells with vectors rather than single entries. Vector have the same value, for instance, `c(99220, 99220, 99220)`. Perhaps because of some repeated entries per Output. Not sure. – luchonacho Oct 19 '21 at 14:26
  • @tmfmnk Yes. Confirmed. For instance, add an extra row of data, 4 and 91 respectively. And you get a vector as in my real data. – luchonacho Oct 19 '21 at 14:28

2 Answers2

1

If you have duplicate inputs for a given output, then assuming you don't care to count them or treat them any differently, then all methods below work by replacing df with unique(df). (One can also use dplyr::distinct if preferred.)

Once you've resolved uniqueness, then ... this is "just" reshaping/pivoting from long to wide.

base R

stats::reshape is a little hard to work with, and it requires some things that are not uniquely present. For example, it requires idvar and the v.names variables to be unique columns (so we duplicate Input):

df$Input2 <- df$Input
out <- reshape(unique(df), idvar = "Input", v.names = "Input2", timevar = "Output", direction = "wide")
out
#   Input Input2.91 Input2.92
# 1     1         1         1
# 2     2         2        NA
# 3     3         3         3
# 4     4         4         4
# 6     5         5        NA
names(out) <- gsub("Input2\\.", "", names(out))
# out[,-1]
  91 92
# 1  1  1
# 2  2 NA
# 3  3  3
# 4  4  4
# 6  5 NA

tidyr

(Ditto on the duplicated column.)

library(dplyr)
library(tidyr) # pivot_wider
df %>%
  distinct() %>%
   mutate(Input2 = Input) %>%
   pivot_wider(Input, names_from = "Output", values_from = "Input2") %>%
   select(-Input)
# # A tibble: 5 x 2
#    `91`  `92`
#   <dbl> <dbl>
# 1     1     1
# 2     2    NA
# 3     3     3
# 4     4     4
# 5     5    NA

data.table

No such requirement on needing "Input2" here.

library(data.table)
dcast(unique(as.data.table(df)), Input ~ Output, value.var = "Input")[,-1]
#       91    92
#    <num> <num>
# 1:     1     1
# 2:     2    NA
# 3:     3     3
# 4:     4     4
# 5:     5    NA

Here, unique can go inside or outside as.data.table(.), your choice.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks! I just realised my data has repeated entries within Output, a feature not in the example. In that case, the data.table code (which tried before) computes frequencies rather than providing values. The warning message is clear: `Aggregate function missing, defaulting to 'length'`. Can your code be adapted to such example? Alternatively, I can remove duplicates beforehand. Just came to my mind as a workaround. – luchonacho Oct 19 '21 at 14:36
  • 1
    Sure. I just tested using `df <- rbind(df, df[4,])` (from your original `df`), and all code works replacing mention of `df` with `unique(df)`. This is ultimately a pivoting/reshaping operation which is complicated by non-uniqueness. – r2evans Oct 19 '21 at 14:40
  • 1
    I see. So more advanced tools like here or in the comment did not work because of duplicates. Taking that into consideration, more advanced methods do work fine. – luchonacho Oct 19 '21 at 14:52
  • Naturally. I see the issue now. There are actually different entities producing outputs with different inputs. That's why there are repeated entries per output. Forgot about that bit. My ID (output) is thus one level up than the original data (firms). Otherwise this was a standard reshaping, as you say. – luchonacho Oct 19 '21 at 14:55
  • If this resolves it, it can be accepted and potentially marked as a dupe, as it can be resolved by https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format. – r2evans Oct 19 '21 at 14:59
0

Trivially, since I am not interested in duplicates, I just had to remove them beforehand, after which code in this post works fine.

Remove with df <- df[!duplicated(df[c("Output","Input")]),]

luchonacho
  • 6,759
  • 4
  • 35
  • 52