0

I've got the following problem (which is actually two problems): I have a csv-file with transactions. But all the items bought with a transactionID are stored in multiple lines of a single cell.

It looks like this

TransactionID    Items

1234             Milk
                 Butter
                 Bread

2345             Milk
                 Bread

3456             Beer
                 Milk

4567             Beer
                 Butter

As you can see not all items are used in each transaction.

How can I import my data in R as a transaction matrix that looks like this

TransactionID    Milk    Butter    Bread    Beer
1234             1       1         1        0
2345             1       0         1        0
3456             1       0         0        1
4567             0       1         0        1

Can it be done in a single, elegant step? After the import I want to analyze my data using the arules package.

Thanks in advance!

AutoMiner
  • 80
  • 5
  • 1
    Including pictures of data is not very helpful. See [how to create a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to include all the necessary information in the question itself. – MrFlick Nov 16 '16 at 14:25
  • Tried to fix readability. – AutoMiner Nov 25 '16 at 15:40

2 Answers2

1

This is not single line and assumes that words are split by blank space. I find the unique words first then do a double loop.

u <- unique(do.call('c', strsplit(df$items, ' ')))
for (i in 1:nrow(df)) {
  for (j in u) {
    df[i, j] <- 1 * (j %in% strsplit(df$items[i], ' ')[[1]])
  }
}
Dirk Nachbar
  • 542
  • 4
  • 16
0

Thank you @DirkNachbar! Your approach worked and definetely pushed me in the right direction. Just some small adjustment was necessary: The words were split by a linebreak.

Afterwards I added the matrix to a file and re-impoted it for arules.

write(mat, file = "TransactionMatrix", sep = "\n")
trans = read.transactions("TransactionMatrix", format = "basket", sep = "\n")

The Problem about this approach is the runtime. With several ten thousand transactions and a few thousand items it takes ages to compute the matrix.

Since my final goal was importing into arules´ transaction matrix format I eventually used another approach and concatenated all items to one large list.

df <- read.csv("Transactions.csv", header = TRUE, sep = ";", dec = ",")

MaxTransactions <- 0
for (i in 1:length(df$items)) {

    add <- length(strsplit(as.character(df$items[i]),'\n')[[1]])
    MaxTransactions <- MaxTransactions + add
    }

IDs <- rep(NA, MaxTransactions)
ITEMs <- rep(NA, MaxTransactions)

Position <- 0

for (i in 1:length(df$items)) {

    tempITEM <- c(strsplit(as.character(df$items[i]),'\n')[[1]])
    tempID <- rep(as.character(df$TransID[i]),length(tempITEM))

    for (j in 1:length(tempITEM)) {
        IDs[Position+j] <- tempID[j]
        ITEMs[Position+j] <- tempITEM[j]
        }
    Position <- Position + length(tempITEM)
    }

That allowed me to coerce to a arules´ transaction matrix from a "transaction" format instead of "basket" format.

a_df <- data.frame(ID = as.factor(IDs),ITEM = as.factor(ITEMs))

TransList <- split(a_df[,"ITEM"],a_df[,"ID"]) 
TransMat <- as(TransList, "transactions")

This worked well for me.

AutoMiner
  • 80
  • 5