GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

Question

I am trying to fit Polish local government election results in 2015 following the superb blog by @GavinSimpson. https://www.fromthebottomoftheheap.net/2017/10/19/first-steps-with-mrf-smooths/ I joined my xls data with the shp data using a 6 digit identifier (there may be leading 0's). I kept it as a text variable. EDIT, I simplified the identifier and am now using a sequence from 1 to nrow to simplify my question.

library(tidyverse)
library(sf)
library(mgcv)

# Read data
# From https://www.gis-support.pl/downloads/gminy.zip shp file

boroughs_shp <- st_read("../../_mapy/gminy.shp",options = "ENCODING=WINDOWS-1250",
                     stringsAsFactors = FALSE ) %>% 
  st_transform(crs = 4326)%>% 
  janitor::clean_names() %>% 
# st_simplify(preserveTopology = T, dTolerance = 0.01) %>% 
  mutate(teryt=str_sub(jpt_kod_je, 1, 6)) %>% 
  select(teryt, nazwa=jpt_nazwa, geometry)

# From https://parlament2015.pkw.gov.pl/wyniki_zb/2015-gl-lis-gm.zip data file
elections_xls <-
  readxl::read_excel("data/2015-gl-lis-gm.xls",
             trim_ws = T, col_names = T) %>% 
  janitor::clean_names() %>% 
  select(teryt, liczba_wyborcow, glosy_niewazne)

elections <-
  boroughs_shp %>% fortify() %>% 
  left_join(elections_xls, by = "teryt") %>% 
  arrange(teryt) %>%
  mutate(idx = seq.int(nrow(.)) %>% as.factor(), 
         teryt = as.factor(teryt)) 

# Neighbors

boroughs_nb <-spdep::poly2nb(elections, snap = 0.01, queen = F, row.names = elections$idx )
names(boroughs_nb) <- attr(boroughs_nb, "region.id")

# Model

ctrl <- gam.control(nthreads = 4) 
m1 <- gam(glosy_niewazne ~ s(idx, bs = 'mrf', xt = list(nb = boroughs_nb)), 
          data = elections,
          offset = log(liczba_wyborcow), # number of votes
          method = 'REML', 
          control = ctrl,
          family = betar())

Here is the error message:

    Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) : 
  mismatch between nb/polys supplied area names and data area names
In addition: Warning message:
In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") :
  the condition has length > 1 and only the first element will be used

elections$idx is a factor. I am using it to give names to boroughs_nb to be absolutely sure I have the same number of levels. What am I doing wrong?

EDIT: The condition mentioned in error message is met:

> all(sort(names(boroughs_nb)) == sort(levels(elections$idx)))
[1] TRUE

Make sure that `teryt` is a factor; at least one of the warnings is complain about that. — Gavin Simpson, Jun 12 '19 at 17:41
@GavinSimpson Thank you very much for helping me out. I am trying to follow your hint. It seems that poly2nb output must also be a 'class factor' somehow. But it is a list with names (char) and numeric values for neighbors of varying length. How to make it acceptable for mrf? Can names of poly2nb output be factors at all? What about the unequal lists of numerics for neighbors? — Jacek Kotowski, Jun 13 '19 at 06:43
You are setting up the smooth as `s(teryt, bs = "mrf", ...)` and the warning implies that `teryt` is not a factor. Is `teryt` a factor in `df_wybory`? Does that factor have *exactly* the same levels as the names of the object you pass to `nb`? — Gavin Simpson, Jun 13 '19 at 15:39
@GavinSimpson I have simplified the code and shortened the question. Now in the code I am using an idx column which is just an index from 1 to nrow, turn it to factor. Later I will use it to build a nb list, so that I am sure about the number of levels. — Jacek Kotowski, Jun 13 '19 at 16:26
@GavinSimpson the factors part disappeared, now I am dealing with error with one warning only: In if (all.equal(sort(a.name), sort(levels(k))) != TRUE) stop("mismatch between nb/polys supplied area names and data area names") — Jacek Kotowski, Jun 13 '19 at 16:35
That message is telling you that `all(sort(names(boroughs_nb)) == sort(levels(idx)))` must be `TRUE` and it isn't; the levels of `idx` don't match the names of the regions in the neighbour list. Basically you need to have a factor with levels that match the names on the neighbour list you pass to `xt`. — Gavin Simpson, Jun 13 '19 at 16:50
@GavinSimpson I am trying to make index as simple as possible... now the test is passed all(sort(names(boroughs_nb)) == sort(levels(elections$idx))) gives [1] TRUE but the "mismatch" error still remains — Jacek Kotowski, Jun 13 '19 at 16:56
What does `all.equal(sort(names(boroughs_nb)), sort(levels(elections$idx)))` say? It's failing this check so something isn't right. That's also a bug in mgcv (the warning), but that this is cropping up indicates that mgcv doesn't think these are equal — Gavin Simpson, Jun 13 '19 at 17:39
@GavinSimpson all.equal is ok too. I am not fluwbt enough in R to try other approach ie turn nb list items all from numeric to factors with equal levels. Maybe this is expected to be factors too? I was trying this but the sp_touches does not go through al data and throws a single “geometry error and does not return result. poly2nb returns result but the elements of a list are numeric not factor... :https://stackoverflow.com/a/49299799/3480717 — Jacek Kotowski, Jun 13 '19 at 21:11
Any chance you can make the data available so I can run this and look? Feel free to email me; you'll find my gmail address at the blog post you cited in your Q. — Gavin Simpson, Jun 13 '19 at 21:25
@GavinSimpson i sent you an e-mail with my original script and links to data. I am now also thinking maybe it is a problem with corrupt geometry. — Jacek Kotowski, Jun 14 '19 at 16:44
@GavinSimpson I have now also tried to turn all the neighbor lists to factors and tried a method st_touches instead of poly2nb it seems that now I have all I could as factor: boroughs_nb <- lwgeom::st_make_valid(elections) %>% st_touches(sparse = T) %>% map(function(x) factor(x, levels = elections$idx)) names(boroughs_nb) <- elections$idx and all.equal(sort(names(boroughs_nb)), sort(levels(elections$idx))) returns TRUE — Jacek Kotowski, Jun 17 '19 at 09:30

score 1 · Answer 1 · answered Jun 21 '19 at 13:14

It seems that I solved the issue, maybe not quite realizing how it did being stat beginner.

First, not a single NA should be present in modeled data. There was one. After that the mcgv seemed to run, but it took long time (quarter of an hour) and inexplicably for me, only when I limited no of knots to k=50, with poor results (less or more and it did not return any result) and with warning to be cautious about results. Then I tried to remove offset=log(liczba_wyborcow) ie offset number of voters and made number of void votes per 1000 my predicted variable.

elections <-
 boroughs_shp %>%  
 left_join(elections_xls, by = "teryt") %>% na.omit() %>% 
 arrange(teryt) %>% 
 mutate(idx = row_number() %>% as.factor()) %>% 
 mutate(void_ratio=round(glosy_niewazne/liczba_wyborcow,3)*1000)

Now that it is a count, why not try change family = betar() in gam formula to poisson() - still not a good result, and then to negative binomial family = nb() Now my formula looks like

m1 <-
gam(
 void_ratio ~ s(
 idx,
 bs = 'mrf',
 k =500,
 xt = list(nb = boroughs_nb),
 fx = TRUE),
 data = elections_df,
 method = 'REML', 
 control = gam.control(nthreads = 4),
 family = nb()
)

It seems now to be blazingly fast and return valid results with no warnings or errors. On a laptop with 4 cores Intel Core I7 6820HQ @ 2.70GHZ 16GB Win10 it takes now 1-2 minutest to build a model.

In brief, what I changed was: remove a single NA, remove offset from formula and use negative binomial distribution.

Here is the result of what I wanted to achieve, from left to right, real rate of void votes, a rate smoothed by a model and residuals indicating discrepancies. The mcgv code let me do that.

score 1 · Answer 2 · answered May 13 '23 at 00:44

To anyone who also finds this post years later: I don't know if I was facing the same problem as the original poster, but it presented in the same way.

I determined that the cause of my issue was that not all levels of the factor (ix in the above example) had rows associated for them. So in the above example, the equivalent would be boroughs for which there were no records in the dataset.

Because of this, when the model looked at the factor levels in the dataset, it saw only a subset of the levels defined in the neighborhood structure (boroughs_nb in the above example), which would cause the mismatch error.

Once I restructured my data so that every factor level had records associated with it, the error went away and the model fit correctly.

GAM with mrf smooth - errors (mismatch between nb/polys area names and data area names

2 Answers2