2
  • I have a big data frame (~ 280000 rows x 1200 columns), each row represents a basket of items.
  • The first column has the basket id.
  • The next ~120 columns have either a 4 digit item code (of one of the items present in the basket) or are blank (for remainder of 120 cells after all items of the basket have been accounted for).
  • The subsequent columns (from 121 till 1200) are each named with one of the unique 4 digit item codes from the items universe. All these columns are blank.

Now, I want to tag the cells in these columns (121 to 1200), if that item (the column name) appears in that row/basket.

Following is a smaller version of the dataframe (df);

df <- data.frame(BasketID = c("001", "002"),
                 Item1 = c(1001, 1002), Item2 = c(1002,""), Item3 = "",
                 `1001` = "", `1002` = "", `1003` = "", check.names=F)

BasketID   Item1   Item2   Item3   ...   1001   1002   1003
001        1001    1002
002        1003

Below is what I require;

BasketID   Item1   Item2   Item3   ...   1001   1002   1003
001        1001    1002                  tag    tag
002        1003                                        tag

I wrote the following for loop to achieve the above;

for (i in rownames(df)) {  

    for (j in colnames(df[,121:1200])) {  

        if (j %in% df[i,121:1200]) { 

            df[i,j] <- "tag"
        }
    }
}

However, since the dataframe is big, the above command is taking forever to run forcing me to abort midway. Is there a more efficient way to do this? Thanks v. much in advance!!

  • 1
    related: https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format – jogo Feb 28 '18 at 07:59
  • Hi @jogo, couldn't find the link you shared useful for my problem. Infact, the link popped when I was looking up for a solution to this problem. There, there is a separate vector using which dataframe is being filtered. however, what I need is lookups within my dataframe without filtering. Let me know if you can think of something else as well. Thanks! – Aditya Roongta Feb 28 '18 at 08:31
  • Can you please giva [a small reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? i.e. edit your question. – jogo Feb 28 '18 at 08:35
  • You can take the example I shared in my post. Consider it to be a 2x6 dataframe like in the example. If I had to run the for loop (shared in my post) on this 2x6 df, I guess it would run in a jiffy. It's just that my actual dataframe is a much bigger version of this 2x6 df, that's why not possible to share here. – Aditya Roongta Feb 28 '18 at 08:40
  • 1
    Please use `dput(df)` to show your dataframe (or a definition of it). What value is in your dataframe `df` for `BasketID` 001 and `Item3`? – jogo Feb 28 '18 at 08:42
  • Hey @jogo, have modified my post as requested. To answer your question, currently there is no value in BasketID 001 and Item3. – Aditya Roongta Feb 28 '18 at 09:08
  • In your example there is a value: it is character with length 0. – jogo Feb 28 '18 at 09:12
  • @jogo Yes, stand corrected! – Aditya Roongta Feb 28 '18 at 09:15

1 Answers1

1

This might be more easily done using the data.table package. Convert into a long format using melt. Subset to those with valid value. Then pivot the data using length > 0 as the fun.aggregate:

library(data.table)
dcast.data.table(
    melt(setDT(df), id.vars="BasketID"),
    BasketID ~ value,
    function(x) length(x) > 0,
    subset=.(value!=""))

#the join the results with original dataset to get OP's desired format
tags[df, on=.(BasketID)]

#    BasketID  1001  1002  1003 Item1 Item2 Item3
# 1:      001  TRUE  TRUE FALSE  1001  1002      
# 2:      002 FALSE FALSE  TRUE  1003            

data:

df <- data.frame(BasketID = c("001", "002"),
    Item1 = c(1001, 1003), Item2 = c(1002,""), Item3 = "")
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • 1
    `D <- melt(setDT(df), id.vars="BasketID"); dcast(D[value!=""], BasketID ~ value, function(x) if (length(x)!=0) "tag" else "")` – jogo Feb 28 '18 at 09:33