3

I have a frequency table aggregated from 800 millions of records and am wondering if I can use a package to calculate 1st order transition matrix from the frequency table, which is not symmetric because some state just never happened again. A sample of the frequency table is:

library(data.table)
model.data <- data.table(state1 = c(3, 1, 2, 3), state2 = c(1, 2, 1, 2), Freq = c(1,2,3,4))

model.data looks like this:

state1 state2 n
3 1 1
1 2 2
2 1 3
3 2 4

Using the package pollster, I can compute the proportion table:

library(pollster)
crosstab(model.data, state1, state2, Freq)
state1 1 2 n
1 0 100 2
2 100 0 3
3 20 80 5

However, the symmetric transition matrix I am looking for is:

state1 1 2 3 n
1 0 100 0 2
2 100 0 0 3
3 20 80 0 5

That is, I still want to include the state 3 even though no one transitioned to it, and the code should be able to automatically find out 3 needs to be appended with a column of 0s.

I am not sure if the markovchain package with the markovchainFit function is going to handle my 800 million rows of data that I need to transform into a list of millions of sequences, due to memory constraints and slow computing speed.

Does anyone know?

smz
  • 263
  • 4
  • 11

2 Answers2

5

It appeared you might have known about the stats::xtabs function since, since the result you are asking us to work with appears to be the result of the base::as.data.frame.table function which converts the "wide" result from a table call into a "long" data.frame representation of the same data. (But maybe not since you posted the pollster code that adds an additional confusing column.) Here we will be reversing that procedure so we can recover a matrix (which R table objects inherit from).

Do notice that I'm using your data object, but not using pkg:pollster code, since your tables didn't appear to be based on that data.table object.

How to get the zero column, ... just put in a single zero data element in the state2=3 "column" position. You only need to add one data point in state2 for the entire column, but it obviously needs to be from some state1 value. It can be from any of the state 1 values:

model.data <- data.table(state1 = c(3, 1, 2, 3,  3), 
                         state2 = c(1, 2, 1, 2,  3), 
                         Freq = c(1,2,3,4, 0))
xtabs(Freq~state1+state2, model.data)
#------------
      state2
state1 1 2 3
     1 0 2 0
     2 3 0 0
     3 1 4 0

Note added: Just to show that this works in the "pollster" tidyverse environment ...

> library(pollster)
> crosstab(model.data, state1, state2, Freq)
# A tibble: 3 x 5
  state1   `1`   `2`   `3`     n
   <dbl> <dbl> <dbl> <dbl> <dbl>
1      1     0   100     0     2
2      2   100     0     0     3
3      3    20    80     0     5

And further note that the "n" column would need to be dropped if you wanted to make a transition matrix. (I can't quite figure out what it represents.)

Regarding how to make a transition matrix (if that's what is needed, then divide the matrix by the rowSums result, since transition matrices need to have each row sum to unity)

 mat <- xtabs(Freq~state1+state2, model.data)

 trans_mat <- mat/rowSums(mat)
 trans_mat
#-----
       state2
state1   1   2   3
     1 0.0 1.0 0.0
     2 1.0 0.0 0.0
     3 0.2 0.8 0.0

Now you can calculate the state at any discrete interval using matrix multiplication: See ?'%*%' or matrix exponentiation ?expm::expm

Here's further coding a plots related to matrix operations on transition matrices to generate Markov simulations: Simple Markov Chain in R (visualization)

There are further statistical operations on Markov sequences delivered in the markovchain package, but I did not see that it had anything for actually construction a transition matrix from data. I could be wrong about that since I only read the first 5 packages of the vignette. (They seemed to assume everyone would know how to do that, although when I wrote the code for the answer I linked above, I needed to go back to my books for a refresher.)

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Great answer! Upvoted! I think `proportions` can help shorten the code a bit :) – ThomasIsCoding Jul 31 '21 at 18:36
  • @ThomasIsCoding : I've never used the `proportions` function. Didn't know what it did. It has exactly the same code base as prop.table which is what I've used. – IRTFM Jul 31 '21 at 18:42
  • Thanks a lot! Great thoughts here and learned something! – smz Jul 31 '21 at 20:55
4

An option with igraph

model.data %>%
  setorder(state1) %>%
  graph_from_data_frame() %>%
  as_adjacency_matrix(attr = "Freq", sparse = FALSE) %>%
  proportions(1)  # 1 sets rows as the margin, similar to `prop.table`

gives

    1   2 3
1 0.0 1.0 0
2 1.0 0.0 0
3 0.2 0.8 0

Or with base R

> proportions(xtabs(Freq ~ ., model.data), 1)
      state2
state1   1   2
     1 0.0 1.0
     2 1.0 0.0
     3 0.2 0.8
IRTFM
  • 258,963
  • 21
  • 364
  • 487
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81