3

I am trying to scrape data from Missouri's COVID demographics data from this site and put it into RStudio. Also, this is the Tableau URL I found from the HTML: https://results.mo.gov/t/COVID19/views/Demographics/Public-Demographics?:embed=y&:showVizHome=no&:host_url=https%3A%2F%2Fresults.mo.gov%2F&:embed_code_version=3&:tabs=no&:toolbar=no&:showAppBanner=false&:%E2%80%98iframeSizedToWindow%E2%80%99=%E2%80%98true%E2%80%99&:dataDetails=no&:display_spinner=no&:loadOrderID=0

I've been using this forum here for guidance on how to scrape a public tableau dashboard into R but it doesn't seem to give me any rows or columns for my data frame.

Here is my code I tried to recreate:

library(rvest)
library(rjson)
library(httr)
library(stringr)

tableauHost <- "https://results.mo.gov"

url <- httr::modify_url(tableauHost,
                        path = "/t/COVID19/views/Demographics/Public-Demographics",
                        query = list(":embed" = "y",
                                     ":showVizHome" = "no",
                                     ":host_url" = "https://results.mo.gov/",
                                     ":embed_code_version" = 3,
                                     ":tabs" = "no",
                                     ":toolbar" = "no",
                                     ":showAppBanner" = "false",
                                     ":'iframeSizedToWindow'" = "'true'",
                                     ":dataDetails" = "no",
                                     ":display_spinner" = "no",
                                     ":loadOrderID" = 0)
                        )

body <- read_html(url)
data <- body %>% 
  html_nodes("textarea#tsConfigContainer") %>% 
  html_text()
json <- fromJSON(data)

url <- modify_url(tableauHost, path = paste(json$vizql_root, "/bootstrapSession/sessions/", json$sessionid, sep =""))

resp <- POST(url, body = list(sheet_id = json$sheetId), encode = "form")
data <- content(resp, "text")

extract <- str_match(data, "\\d+;(\\{.*\\})\\d+;(\\{.*\\})")
info <- fromJSON(extract[1,1])
data <- fromJSON(extract[1,3])

worksheet <- "+ PCR by age"

columnsData <- data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap[[worksheet]]$presModelHolder$genVizDataPresModel$paneColumnsData

i <- 1
result <- list();
for(t in columnsData$vizDataColumns){
  if (is.null(t[["localBaseColumnName"]]) == FALSE) {
    result[[i]] <- list(
      localBaseColumnName = t[["localBaseColumnName"]], 
      valueIndices = columnsData$paneColumnsList[[t$paneIndices + 1]]$vizPaneColumns[[t$columnIndices + 1]]$valueIndices,
      aliasIndices = columnsData$paneColumnsList[[t$paneIndices + 1]]$vizPaneColumns[[t$columnIndices + 1]]$aliasIndices, 
      dataType = t[["dataType"]],
      stringsAsFactors = FALSE
    )
    i <- i + 1
  }
}
dataFull = data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments[["0"]]$dataColumns

buildOffset = function(x){
  return(if(x>=0) 0 else -1)
}

data_index <- 1
name_index <- 1
frameData <-  list();
frameNames <- c()
for(t in dataFull) {
  for(index in result) {
    if (t$dataType == "cstring"){
      if (length(index$valueIndices) > 0) {
        j <- 1
        vector <- character(length(index$valueIndices))
        for (it in index$valueIndices){
          vector[j] <- t$dataValues[abs(it)+1]
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$localBaseColumnName, "value", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
      if (length(index$aliasIndices) > 0) {
        j <- 1
        vector <- character(length(index$aliasIndices))
        for (it in index$aliasIndices){
          vector[j] <- t$dataValues[abs(it) + buildOffset(it) + 1]
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$localBaseColumnName, "alias", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
    }
  }
}

columnToKeep = c('[Student Aid Program Type]-value','[Student Aid Program]-value', '[:Measure Names]-alias', '[Multiple Values]-alias')
df <- NULL
for(i in 1:length(frameNames)){
  if (frameNames[i] %in% columnToKeep){
    df[frameNames[i]] <- frameData[i]
  }
}
options(width = 1200) #for readability
df <- as.data.frame(df, stringsAsFactors = FALSE)
print(df)

but it is only giving me this as output

print(df)
data frame with 0 columns and 0 rows

I know the columnToKeep part where we define the columns for the data frame will not match the data I'm looking for, so would the incorrect number of columns affect this? I've been trying to decipher the original code and implement it with the Missouri COVID's dashboard but seeing it makes me extremely confused everytime. Any help with this would be greatly appreciated!!

Doughey
  • 147
  • 7

1 Answers1

3

The issue is that it needs to deal with aliasIndices in a different way. The flow is the following :

  • select the worksheet under

    data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap
    

Under presModelHolder$genVizDataPresModel$paneColumnsData :

  • get all the columns under vizDataColumns, note the fieldCaption as column name
  • note the paneIndices and columnIndices for all columns
  • for each columns, under paneColumnsList, assign the row to the specified columnIndices and the value under vizPaneColumns which gives valueIndices and aliasIndices which are indices to the actual data in the dictionnary

In the dictionnary (under data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments), you have the full data. You just need to match the indices you have previously extracted. In order to match those indices, you need to have the dataType of the column to know which object to select in the dictionnary then :

  • for valueIndices just assign indices in the list
  • for aliasIndices, if the value is > 0 assign the indice, if it's a negative number you match the indices in the list with dataType "cstring"

I'm not sure if this algo will work for all tableau data but it's working very well for a few I've tested :

library(rvest)
library(rjson)
library(httr)
library(stringr)

#replace the hostname and the path if necessary
host_url <- "https://results.mo.gov"
path <- "/t/COVID19/views/Demographics/Public-Demographics"

body <- read_html(modify_url(host_url, 
                             path = path, 
                             query = list(":embed" = "y",":showVizHome" = "no")
))

data <- body %>% 
  html_nodes("textarea#tsConfigContainer") %>% 
  html_text()
json <- fromJSON(data)

url <- modify_url(host_url, path = paste(json$vizql_root, "/bootstrapSession/sessions/", json$sessionid, sep =""))

resp <- POST(url, body = list(sheet_id = json$sheetId), encode = "form")
data <- content(resp, "text")

extract <- str_match(data, "\\d+;(\\{.*\\})\\d+;(\\{.*\\})")
info <- fromJSON(extract[1,1])
data <- fromJSON(extract[1,3])

worksheets = names(data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap)

for(i in 1:length(worksheets)){
  print(paste("[",i,"] ",worksheets[i], sep=""))
}
selected <-  readline(prompt="select worksheet by index: ");
worksheet <- worksheets[as.integer(selected)]
print(paste("you selected :", worksheet, sep=" "))

columnsData <- data$secondaryInfo$presModelMap$vizData$presModelHolder$genPresModelMapPresModel$presModelMap[[worksheet]]$presModelHolder$genVizDataPresModel$paneColumnsData

i <- 1
result <- list();
for(t in columnsData$vizDataColumns){
  if (is.null(t[["fieldCaption"]]) == FALSE) {
    paneIndex <- t$paneIndices
    columnIndex <- t$columnIndices
    if (length(t$paneIndices) > 1){
      paneIndex <- t$paneIndices[1]
    }
    if (length(t$columnIndices) > 1){
      columnIndex <- t$columnIndices[1]
    }
    result[[i]] <- list(
      fieldCaption = t[["fieldCaption"]], 
      valueIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$valueIndices,
      aliasIndices = columnsData$paneColumnsList[[paneIndex + 1]]$vizPaneColumns[[columnIndex + 1]]$aliasIndices, 
      dataType = t[["dataType"]],
      stringsAsFactors = FALSE
    )
    i <- i + 1
  }
}
dataFull = data$secondaryInfo$presModelMap$dataDictionary$presModelHolder$genDataDictionaryPresModel$dataSegments[["0"]]$dataColumns

cstring <- list();
for(t in dataFull) {
  if(t$dataType == "cstring"){
    cstring <- t
    break
  }
}
data_index <- 1
name_index <- 1
frameData <-  list()
frameNames <- c()
for(t in dataFull) {
  for(index in result) {
    if (t$dataType == index["dataType"]){
      if (length(index$valueIndices) > 0) {
        j <- 1
        vector <- character(length(index$valueIndices))
        for (it in index$valueIndices){
          vector[j] <- t$dataValues[it+1]
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$fieldCaption, "value", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
      if (length(index$aliasIndices) > 0) {
        j <- 1
        vector <- character(length(index$aliasIndices))
        for (it in index$aliasIndices){
          if (it >= 0){
            vector[j] <- t$dataValues[it+1]
          } else {
            vector[j] <- cstring$dataValues[abs(it)]
          }
          j <- j + 1
        }
        frameData[[data_index]] <- vector
        frameNames[[name_index]] <- paste(index$fieldCaption, "alias", sep="-")
        data_index <- data_index + 1
        name_index <- name_index + 1
      }
    }
  }
}

df <- NULL
lengthList <- c()
for(i in 1:length(frameNames)){
  lengthList[i] <- length(frameData[[i]])
}
max <- max(lengthList)
for(i in 1:length(frameNames)){
  if (length(frameData[[i]]) < max){
    len <- length(frameData[[i]])
    frameData[[i]][(len+1):max]<-""
  }
  df[frameNames[[i]]] <- frameData[i]
}
options(width = 1200)
df <- as.data.frame(df, stringsAsFactors = FALSE)
print(df)

I've made a repository with both R and Python scripts here

Bertrand Martel
  • 42,756
  • 16
  • 135
  • 159
  • I was able to run the code on Python, but there is a few issues with the R code using RStudio.. I’m not sure if I’m doing it correctly, but it will not let me input a number to select a worksheet. The code will wait and it is running forever as I’ve been putting in a number. Even tried hardcoding a number in but I get this error message saying ` > max <- max(lengthList) Error in max(lengthList) : invalid 'type' (list) of argument ` – Doughey Oct 09 '20 at 02:29
  • @Doughey I run the script using Rscript so not sure what is the error maybe the stdin part. You can test the code in python too, it's located on this repo : https://github.com/bertrandmartel/tableau-scraping – Bertrand Martel Oct 09 '20 at 02:30
  • It may be because I am using the RStudio IDE that isn't wanting to work: https://snipboard.io/wIajVC.jpg But even after hardcoding a worksheet number in, I still get this error message in the RStudio console: https://snipboard.io/mhAqtf.jpg – Doughey Oct 09 '20 at 02:37
  • @Doughey it's `selected` variable that should be set to 1 in this case – Bertrand Martel Oct 09 '20 at 02:39
  • hmm I've changed the `selected` variable to be 1 and it is still giving me the same error as seen here: https://snipboard.io/AVdXKu.jpg – Doughey Oct 09 '20 at 02:45
  • I've also ran it using `Rscript file` on CMD and the error still persists. I'm not sure what I'm doing wrong on my end :( – Doughey Oct 09 '20 at 02:49
  • @Doughey I'm installing RStudio to check on that – Bertrand Martel Oct 09 '20 at 02:49
  • Thank you so much!! I'm reading on nested lists in R. I don't think RStudio (or my computer) is processing it correctly – Doughey Oct 09 '20 at 02:57
  • @Doughey I've updated the post, it's working now in RStudio – Bertrand Martel Oct 09 '20 at 03:09
  • I was able to figure it out. For the first part, it looks like I am just going to make hardcoded `selected` variable since RStudio does not want to take any input, but using `Rscript` will take it I changed the issue with `max` function to `max <-max(as.numeric(unlist(lengthList)))` and `df[frameNames[[i]]] <- frameData[i]` After that everything worked perfectly! However I wouldn't have gotten this without your help, so thank you again so so much – Doughey Oct 09 '20 at 03:10
  • input is working in the new code above in RStudio (tested on Windows) – Bertrand Martel Oct 09 '20 at 03:11
  • oh wow you beat me to it! LOL sorry about the confusion – Doughey Oct 09 '20 at 03:13
  • @Doughey I think Rscript on my other system used a different R version maybe much older that the new one – Bertrand Martel Oct 09 '20 at 03:14
  • 1
    ahh i see. Yeah I was looking it up and these are posts from 2013-2014 hahah. But anyways, I really appreciate your time and effort with helping me. I hope you are healthy during these times and stay awesome just the way you are! – Doughey Oct 09 '20 at 03:18
  • @BertrandMartel Can you explain why there's a difference in valueIndices vs aliasIndices? If the value index is negative, why don't we do the same as what we do with the alias index and grab the cstring value at abs(idx) - 1? Does that mean that all cstring values are part of alias columns? I'm just really confused at your intuition behind this logic. I'd appreciate it if you could explain your thought process, that would be really helpful! – Khalil Hijazi Oct 16 '20 at 13:57
  • @KhalilHijazi good question, I've performed a few tests on these. I don't know if this algo is correct but it seems to work on most of dashboard I've encountered. My understanding is that aliasIndices are indices for "alias" values which would be some kind of label so always string imho. – Bertrand Martel Oct 16 '20 at 14:48
  • @KhalilHijazi there is still work in progress for the understanding of these negative values and offset for aliasIndices – Bertrand Martel Oct 16 '20 at 14:53
  • @BertrandMartel Hmm, yeah I just wasn't sure about it. I read online and it seemed like Tableau has this feature where you can change the label of a value column by giving it an "alias" name. This would present the column header with the alias name in the UI, but in the backend, you'd have the original value column's header name stored. So I was thinking that both value and alias could be string data types maybe? – Khalil Hijazi Oct 16 '20 at 15:19
  • @BertrandMartel The logic seems correct, I minified the json output by a good amount and ran through it visually to verify a few examples. However, I noticed that if the tableau dashboard has something like a button that refreshes the data being displayed, that newly displayed data is not included in the json output. – Khalil Hijazi Oct 19 '20 at 13:59
  • 2
    @KhalilHijazi I've implemented the `/select` and `/set-parameter-value` endpoint which gives the ability to click on a clickable item or select a value in a dropdown (for those which have ones). I will release it on Github this week on [this repo](https://github.com/bertrandmartel/tableau-scraping) – Bertrand Martel Oct 19 '20 at 14:03
  • @KhalilHijazi do you have a sample public dashboard for me to check the "refresh button" element ? – Bertrand Martel Oct 19 '20 at 14:04
  • Cool, that's awesome! Could you explain how you're able to do that? Is it something in the http requests? – Khalil Hijazi Oct 19 '20 at 14:04
  • 1
    @KhalilHijazi that's just a call to `/commands/tabdoc/set-parameter-value` for dropdown and `/commands/tabdoc/select` for selection. The hard part is to filter the dropdown items with a `parameterControl` field and for field that can be selected (in a map for example) there is a property `isAutoSelect` in the columns object. Then the response of those API call is not organized the same way as the first call (dashboard) so it needs additionnal parsing but it's the same spirit – Bertrand Martel Oct 19 '20 at 14:07
  • 1
    @KhalilHijazi also I've fixed some things in the code. For example in the code above `dataSegments[["0"]]` only takes the first element in the field. But when there are more than 1 element in dataSegments, you need to merge the two fields so that indices can match – Bertrand Martel Oct 19 '20 at 14:13
  • @BertrandMartel Ah that's what those values were haha. How exactly do you make the call to the /commands/... property? Do you include it as part of the url? Also, yup that makes sense. – Khalil Hijazi Oct 19 '20 at 14:13
  • 1
    @KhalilHijazi an example in [this post](https://stackoverflow.com/questions/63822827/how-to-scrape-a-tableau-dashboard-in-which-data-is-only-displayed-in-a-plot-afte/64190140#64190140) – Bertrand Martel Oct 19 '20 at 14:15
  • 1
    Cool, thanks! I'll check it in a bit and update you if I have any issues – Khalil Hijazi Oct 19 '20 at 14:16
  • 1
    @KhalilHijazi note that I don't take into account `fieldRole` anymore in my script in the above link – Bertrand Martel Oct 19 '20 at 14:17
  • Yup, same here! – Khalil Hijazi Oct 19 '20 at 14:17
  • @BertrandMartel How do you know which localbasecolumnname value to check for? In the post you linked, you used [state_name] and I know that's obvious for that example, but I'm sure others might be less obvious than that. Is there an approach you're using for that? – Khalil Hijazi Oct 19 '20 at 19:27
  • 1
    @KhalilHijazi it has `isAutoSelect` set to true, and you will have to choose worksheet name/column name/value to select it – Bertrand Martel Oct 19 '20 at 19:35
  • @BertrandMartel I tried doing what you mentioned in the post you linked, and I got the data values I need. However, it seems like whenever there are duplicates of a value, the data values property only holds one copy of that value. Would that cause an issue with index mappings? – Khalil Hijazi Oct 20 '20 at 15:41
  • @KhalilHijazi what do you mean by "duplicate of a value" in the indices ? There are often the same values repeated with indices like [23,23,23,23....] or [-600,-600,-600...] when the value pointed by this index is `%null%` in the cstring array – Bertrand Martel Oct 20 '20 at 15:44
  • @BertrandMartel I mean for example, if I'm working with county data for a specific state and two counties have the same value in the dashboard, that value is only included once in the dataValues json property, so I'm not sure if the alias / value indices help with matching in such cases. – Khalil Hijazi Oct 20 '20 at 15:52
  • I don't think there will be any problem but if you have one example that's not working you can send me the dashboard url and the data you want to pick – Bertrand Martel Oct 20 '20 at 17:44
  • @BertrandMartel Okay, I just have one issue. I've played around with your implementation, and I've changed things up on my end for performance reasons, style, and whatnot, but there seems to be some issue with the `alias` / `value` indexing. This is especially the case for select commands. Sometimes, it will throw an index error and other times, the index will work correctly. Is that because I don't have a time delay between the series of http requests? Or is the index mapping function just have a bug in the logic? – Khalil Hijazi Oct 30 '20 at 05:55
  • @BertrandMartel Just as a heads up, there seems to be an issue for select commands as I mentioned earlier. Even if I run the repl links on your repo for the select command, it throws an error sometimes. It seems like sometimes, they throw in a datavalues column with an index that's out of range for that list of values, and that's what causes the error. Perhaps, we're not filtering the zones out correctly. – Khalil Hijazi Nov 20 '20 at 21:14