1

Let's suppose I have a data frame with the following parameters

DATA <- data.frame(ROWID, ID1, NAME1, ...IDn, NAMEn)

Sample of what the data might look like:

ROWID | ID1 | NAME1 | ID2 | NAME2 | IDn | NAMEn
001   | 001 | FAS   | 002 | MAS   | 999 | ZOO
002   | 003 | BIN   | 004 | DUN   | 998 | SOO

Where I have 201 columns by 10k+ rows. What I would like to do is to reshape this data such that for each row in the original DATA, I produce a set of rows in a subsequent data frame. Each row would consist of the originating ROWID, IDa, NAMEa, IDb, NAMEb pairs such that the first is matched with all others (99 pairs containing ID1, 98 with ID2, and so on). This would occur for each row producing a large data frame of all possible combinations within rows for every row. The result would look like:

ROWID1 | ID1 | NAME1 | ID2 | NAME2
ROWID1 | ID1 | NAME1 | ID3 | NAME3
...
ROWID1 | ID2 | NAME2 | ID3 | NAME3
...
ROWID2 | ID1 | NAME1 | ID2 | NAME2
ROWID2 | ID1 | NAME1 | ID3 | NAME3
...

The code I produced to do this is as follows. It works great, but only on smaller data frames. The full data frame is painfully slow, and I am hoping to have alternatives to speed it up using functions or something else of which I am unaware. Thanks in advance!!

DATA <- data.frame(as described above)
META <- data.frame(ROWID=numeric(0),ID1=numeric(0),
                   BUS1=character(0),ID2=numeric(0),BUS2=character(0))
for (i in 1:length(DATA$ROWID)) {
  SET <- data.frame(ROWID=numeric(0),ID1=numeric(0),
                   BUS1=character(0),ID2=numeric(0),BUS2=character(0))
  ROWID <- DATA[i,1]
       for (x in seq(3,ncol(DATA),2)) {
       for (y in seq(x,ncol(DATA),2)) {
             ID1 <- DATA[i,x-2]
             BUS1 <- DATA[i,x]
             ID2 <- DATA[i,y-2]
             BUS2 <- DATA[i,y]
             if (!is.na(BUS1) && !is.na(BUS2)) {
             NEW <- cbind(ROWID, ID1, BUS1, ID2, BUS2)
             SET <- rbind(SET, NEW)
                                    }
            }
         }
           META <- rbind(META, SET)
       }
MarkCCooke
  • 43
  • 4
  • 1
    It will much easier to help if you provide a [**small toy data set**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Henrik Feb 13 '14 at 22:37
  • How many rows do you have in dataframe, and how many are you expecting in META? question is not very clear to me at least, can you add more? – Ananta Feb 13 '14 at 22:37
  • Are `ID1`, `ID2`, etc. those actual strings, whereas `ROWID1`, `NAME1`, and `NAME2` are variables that get filled in with values? I second the request for sample data. – Ken Williams Feb 13 '14 at 22:40
  • You have two main optimization problems here. 1. you grow a *data.frame* by adding rows.. and this is terribly inefficient. See http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner . 2. you use a lot of temporary variables: ID1, BUS1, and so on. Getting rid of them will make your code faster. (also, you have a typo in your code: *BUS2* should be declared as a *character()*, not as a *numeric()*) – Jealie Feb 13 '14 at 23:06
  • And while I'm at it: as you do quite a lot of looping - add **require("compiler")** and **enableJIT(3)** to speed up your code. (see http://www.r-statistics.com/2012/04/speed-up-your-r-code-using-a-just-in-time-jit-compiler/ ) – Jealie Feb 13 '14 at 23:11
  • Thanks much for the comments so far. I apologize that it is so abstract and hard to express. I added the "sample" data. Strictly speaking, I have 10,351 rows by 100 column pairs (this will change, though, as it is an applied process). This is strictly a transposition of data; I am not applying any formulae anywhere. I am off to read about the "compiler" and please keep the comments coming! – MarkCCooke Feb 13 '14 at 23:52

1 Answers1

0

Here is my way to write it, which includes all the 3 optimizations I wrote as comments. Also, be careful! your code had some bugs in the addressing the columns... which I hopefully also fixed.

require('compiler')
enableJIT(3)
DATA2 = as.matrix(DATA)
META2 <- matrix(character(),ncol=5,nrow=(nrow(DATA2)*(ncol(DATA2)-2)^2/2)) # you want a matrix instead of a data.frame, and you want to pre-allocate its size
colnames(META2) = c("ROWID","ID1","BUS1","ID2","BUS2")
k=0
for (i in 1:nrow(DATA2)) {
  for (x in seq(3,ncol(DATA2)-2,2)) {
    for (y in seq(x+2,ncol(DATA2),2)) {
      k=k+1
      META2[k,] = c(DATA2[i,1],DATA2[i,x-1], DATA2[i,x], DATA2[i,y-1], DATA2[i,y]) # no need to use temporary variables
    }
  }
}
META2 = as.data.frame(META2)  # converting back to data.frame
META2$BUS1 = as.numeric(META2$BUS1)
META2$BUS2 = as.numeric(META2$BUS2)

I will let you handle yourself the case in which BUS1 or BUS2 is NA - basically, you need to not add these lines (and not increment the variable k), and after the loops you need to crop your matrix to remove the trailing empty rows.

Jealie
  • 6,157
  • 2
  • 33
  • 36
  • With a few tweaks this works perfectly! Thank you very much for both a productive answer as well as helping me see data in a whole new light. There were several gems I will carry away from this, so your time is much appreciated. The addition of the BUS1 && BUS2 not NA made for a much smaller results table, also. And, I discovered that the formula for the matrix row size required is ((ncol*(ncol+1))/2)*nrow which was also very helpful. Cheers! – MarkCCooke Feb 14 '14 at 20:05