1

As it can be seen in the picture, I have a column with order numbers and a column with material numbers.

I want to find how often a pair of materials occur in the same order.

The problem is that I have 30000 rows of order numbers and 700 unique material numbers. Is it even possible?

I was thinking if it was easier to make a matrix with the 700 material numbers both in rows and column, and count number occurrences.

PICTURE

EDIT: The first picture was not a good example. I uploaded this second picture with random material numbers. So I want it to count for each pair (example 10-11, as I highlighted), how many times the appear in the same order. As it can be seen, 10&11 appear in 3 different orders.

PICTURE2

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
M.A
  • 65
  • 1
  • 8

2 Answers2

0

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

Here's how the code would look like:

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))

(Here, X is your dataset)

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials

Then we iterate over the dataset

for(order in levels(as.factor(X$Order.number))){
  # getting the materials in each order
  materials_for_order = X[X$Order.number==order, "Materials"]
  if (length(materials_for_order)>1) {
    # finding each possible pair from the materials list
    all_pairs_in_order = combn(x=materials_for_order, m=2)
    # incrementing the cell at the line and column corresponding to each pair
    for(i in 1:ncol(all_pairs_in_order)){
      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
    }
  }
}

At the end of the loop, the Pairs table should contain everything you need.

Syncrossus
  • 570
  • 3
  • 17
  • There's an issue at the creation of the `Pairs` table. The `levels` function requires a `factor` object which it's most likely not getting if it says 0 obs. of 0 variables, so you have to do `levels(as.factor(X$Materials))` – Syncrossus May 23 '18 at 12:45
  • Could you create a pastebin with a small excerpt of your data? (~20 observations). At least I could test the code with real data to see what's wrong – Syncrossus May 24 '18 at 12:16
  • OK, it's fixed. I had forgotten more calls to `as.factor` and the multiple-index access was broken. Sorry it took so long, my R is a little rusty (no pun intended). It works for me. – Syncrossus May 24 '18 at 13:06
0

Here is a data.table solution

library(data.table)    
combis <- data.table(do.call(rbind, 
    DT[, if (.N > 1) list(combn(Materials, 2, simplify=FALSE)), by=Order.number]$V1
))
ans <- combis[, .N, by=.(V1, V2)]

#check results
setorder(ans, V1, V2)
ans

And a base method:

allComb <- by(DT, DT$Order.number, function(x) {
    if (nrow(x) > 1) {
        return(combn(x$Materials, 2, simplify=FALSE)))
    }
    NULL
}
materialsPairs <- as.data.frame(do.call(rbind, unlist(allComb, recursive=FALSE)))

#https://stackoverflow.com/a/18201245/1989480
res <- aggregate(cnt ~ ., data=transform(materialsPairs, cnt=1), length)

#check results
head(res[order(res$V1, res$V2),])

data:

library(data.table)
set.seed(0L)
M <- 30e3
nOrd <- 3000
DT <- data.table(Order.number=sample(nOrd, M, replace=TRUE), 
    Materials=sample(700, M, replace=TRUE))
setorder(DT, Order.number, Materials)
chinsoon12
  • 25,005
  • 4
  • 25
  • 35