2

I am trying to create a markov transition matrix from sequence of doctor visits for different patients. In my markov model states are the different doctors and connections are visits by patients. A patient can stay with the same provider or transition to another for the next visit. Using that information I need to create a transition matrix.

Here is a part of the data in excel. Data includes more than 30K visits to almost 100 different providers.

Here is the part of the data in excel. data

How can I use this excel data (or csv) and create a Markov transition matrix as number of visits, such as: ....

The matrix I need will look like this:

enter image description here

How can I transform my data to transition matrix with R?

I am fairly new with R and really need help.

Thank you

Dr. Turkuaz
  • 39
  • 1
  • 10
  • 2
    You should create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) showing exactly what your input data looks like. – MrFlick Dec 29 '15 at 21:20

2 Answers2

1

Here's an approach that works with your sample data.

I'll use readxl to get the data and data.table to manipulate it.

Reading data:

library(readxl)
library(data.table)

data <- setDT(read_excel("~/Desktop/Book2.xlsx"))[!is.na(PatId)]

#read_excel doesn't have the option to specify integers... silly...
data[ , (names(data)) := lapply(.SD, as.integer)]

Pre-allocate transition matrix:

provs <- data[ , sort(unique(SeenByProv))]
nprov <- length(provs)

markov <- matrix(nrow = nprov, ncol = nprov,
                 dimnames = list(provs, provs))

Assign row-by-row

for (pr in provs){
  markov[as.character(pr), ] <-
    data[ , {nxt <- SeenByProv[which(SeenByProv == pr) + 1L]
    .(prov = provs, count = 
        sapply(provs, function(pr2) sum(nxt == pr2, na.rm = TRUE)))}, by = PatId
    ][, sum(count), by = prov]$V1
}

This can probably be sped up in a few places, but it works.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Hi Michael, I have this error when i try to install readxl; install.packages("readxl") Installing package into ‘C:/Users/Memin/Documents/R/win-library/3.0’ (as ‘lib’ is unspecified) package ‘readxl’ is available as a source package but not as a binary Warning in install.packages : package ‘readxl’ is not available (for R version 3.0.3) – Dr. Turkuaz Jan 02 '16 at 20:14
  • Thank you and #read_excel doesn't have the option to specify integers... silly... data[ , (names(data)) := lapply(.SD, as.integer)] seems redundant. – Dr. Turkuaz Jan 03 '16 at 18:13
  • @Pediatrician it's a bit strange, I agree. Feel free to file a feature request on [hadley's GitHub page for the project](https://github.com/hadley/readxl/issues) – MichaelChirico Jan 03 '16 at 18:51
  • 1
    @Pediatrician if you're wondering why I did that, it's just to make sure we don't run into numerical issues if the columns are read as `numeric`. – MichaelChirico Jan 03 '16 at 19:20
1

I wanted to compare my method without using data.table and found it was 45x faster (and probably more straightforward to understand).

First, I time the data.table solution from the accepted answer:

rm(list=ls())
library(readxl)
library(data.table)

############## Using data.table method() ######################
data <- setDT(read_excel("Book2.xlsx"))[!is.na(PatId)]
data[ , (names(data)) := lapply(.SD, as.integer)]
provs <- data[ , sort(unique(SeenByProv))]
nprov <- length(provs)
markov <- matrix(nrow = nprov, ncol = nprov, dimnames = list(provs, provs))

system.time(      ## Timing the main loop
  for (pr in provs){
    markov[as.character(pr), ] <-
      data[ , {nxt <- SeenByProv[which(SeenByProv == pr) + 1L]
      .(prov = provs, count =
          sapply(provs, function(pr2) sum(nxt == pr2, na.rm = TRUE)))}, by = PatId
      ][, sum(count), by = prov]$V1
  }
)
#   user  system elapsed 
#  3.128   0.000   3.135 
table(markov)
#markov
#   0    1    2    3    4    5    6    7    8    9   10   11   13   22  140 
#3003  308   89   34   14   11    6    4    1    3    4    1    1    1    1 

Next using only base R calls:

############## Using all base R calls method() ###################
tm_matrix<-matrix(0, nrow = nprov, ncol = nprov, dimnames = list(provs, provs))
d<-read_excel("Book2.xlsx")
d<-d[!is.na(d$PatId),] # Note: Data is already ordered by PatId, DaysOfStudy

baseR<-function(tm_matrix){
  d1<-cbind(d[-nrow(d),-3],d[-1,-3]); # Form the transitions and drop the DaysofStudy
  colnames(d1)<-c("SeenByProv","PatId","NextProv","PatId2");
  d1<-d1[d1$PatId==d1$PatId2,];       # Drop those transition between different patients
  d1$SeenByProv<-as.character(d1$SeenByProv); # transform to strings to use as rownames
  d1$NextProv  <-as.character(d1$NextProv);   # and column names
  for (i in 1:nrow(d1)){                      # Fill in the transition matrix
    tm_matrix[d1$SeenByProv[i],d1$NextProv[i]]<-tm_matrix[d1$SeenByProv[i],d1$NextProv[i]]+1
  };
  return(tm_matrix)
}
system.time(tm_matrix<-baseR(tm_matrix))
#   user  system elapsed 
#  0.072   0.000   0.072 

table(tm_matrix)
#tm_matrix
#   0    1    2    3    4    5    6    7    8    9   10   11   13   22  140 
#3003  308   89   34   14   11    6    4    1    3    4    1    1    1    1 

all.equal(markov,tm_matrix)
#[1] TRUE

My base-R method is 3.135/0.072 = 43.54 faster

fishtank
  • 3,718
  • 1
  • 14
  • 16