2

I'm trying to crack an R workflow for parsing SVG paths, using this file on this webpage. I'm encountering artifacts in the positioning of resulting polygons:

enter image description here

Some of the countries do not align with their neighbours - e.g. US/Canada, US/Mexico, Russia/Asian neighbours. Since the effect hits the countries with more complex polygons it seems likely to be a problem to do with cumulative summing, but I'm unclear where the problem lies in my workflow, which is:

  1. parse raw SVG as XML, and extract all the SVG path strings
  2. parse individual path strings with nodejs's svg-path-parser module
  3. process the resulting data.frames (which combine absolute and relative coordinates) into all absolute coordinates

I reproduce the full workflow here using R (for US/Canada), with an external call to nodejs:

require(dplyr)
require(purrr)
require(stringr)
require(tidyr)
require(ggplot2)
require(rvest)
require(xml2)
require(jsonlite)

# Get and parse the SVG
doc = read_xml('https://visionscarto.net/public/fonds-de-cartes-en/visionscarto-bertin1953.svg')

countries = doc %>% html_nodes('.country')
names(countries) = html_attr(countries, 'id')
cdi = str_which(names(countries), 'CIV') # unicode in Cote d'Ivoire breaks the code
countries = countries[-cdi]

# Extract SVG paths and parse with node's svg-path-parser module.
# If you don't have node you can use this instead (note this step might be the problem):
# d = read_csv('https://gist.githubusercontent.com/geotheory/b7353a7a8a480209b31418c806cb1c9e/raw/6d3ba2a62f6e8667eef15e29a5893d9d795e8bb1/bertin_svg.csv')

d = imap_dfr(countries, ~{
  message(.y)
  svg_path = xml_find_all(.x, paste0("//*[@id='", .y, "']/d1:path")) %>% html_attr('d')
  node_call = paste0("node -e \"var parseSVG = require('svg-path-parser'); var d='", svg_path,
                     "'; console.log(JSON.stringify(parseSVG(d)));\"")
  system(node_call, intern = T) %>% fromJSON %>% mutate(country = .y)
}) %>% as_data_frame()


# some initial processing
d1 = d %>% filter(country %in% c('USA United States','CAN Canada')) %>%
  mutate(x = replace_na(x, 0), y = replace_na(y, 0), # NAs need replacing
         relative = replace_na(relative, FALSE),
         grp = (command == 'closepath') %>% cumsum)  # polygon grouping variable

# new object to loop through
d2 = d1 %>% mutate(x_adj = x, y_adj = y) %>% filter(command != 'closepath')

# loop through and change relative coords to absolute
for(i in 2:nrow(d2)){
  if(d2$relative[i]){ # cumulative sum where coords are relative
    d2$x_adj[i] = d2$x_adj[i-1] + d2$x_adj[i]
    d2$y_adj[i] = d2$y_adj[i-1] + d2$y_adj[i]
  } else{ # code M/L require no alteration
    if(d2$code[i] == 'V') d2$x_adj[i] = d2$x_adj[i-1] # absolute vertical transform inherits previous x
    if(d2$code[i] == 'H') d2$y_adj[i] = d2$y_adj[i-1] # absolute holrizontal transform etc
  }
}

# plot result
d2 %>% ggplot(aes(x_adj, -y_adj, group = paste(country, grp))) +
  geom_polygon(fill='white', col='black', size=.3) +
  coord_equal() + guides(fill=F)

enter image description here

Any assistance appreciated. The SVG path syntax is specified at w3 and summarised more concisely here.


Edit (response to @ccprog)

Here is data returned from svg-path-parser for the H command sequence:

  code  command                 x      y relative country   
  <chr> <chr>               <dbl>  <dbl> <lgl>    <chr>     
1 l     lineto              -0.91  -0.6  TRUE     CAN Canada
2 l     lineto              -0.92  -0.59 TRUE     CAN Canada
3 H     horizontal lineto  189.    NA    NA       CAN Canada
4 l     lineto              -1.03   0.02 TRUE     CAN Canada
5 l     lineto              -0.74  -0.07 TRUE     CAN Canada

Here is what d2 looks like for same sequence after the loop:

  code  command                 x     y relative country      grp x_adj y_adj
  <chr> <chr>               <dbl> <dbl> <lgl>    <chr>      <int> <dbl> <dbl>
1 l     lineto              -0.91 -0.6  TRUE     CAN Canada    20  199.  143.
2 l     lineto              -0.92 -0.59 TRUE     CAN Canada    20  198.  143.
3 H     horizontal lineto  189.    0    FALSE    CAN Canada    20  189.  143.
4 l     lineto              -1.03  0.02 TRUE     CAN Canada    20  188.  143.
5 l     lineto              -0.74 -0.07 TRUE     CAN Canada    20  187.  143.

Does this not look ok?. When I look at raw values for y_adj for H and previous rows they are identical 142.56.


Edit 2: working solution, thanks to @ccprog

d = imap_dfr(countries, ~{
  message(.y)
  svg_path = xml_find_all(.x, paste0("//*[@id='", .y, "']/d1:path")) %>% html_attr('d')
  node_call = paste0("node -e \"var parseSVG = require('svg-path-parser'); var d='", svg_path,
                     "'; console.log(JSON.stringify(parseSVG.makeAbsolute(parseSVG(d))));\"")
  system(node_call, intern = T) %>% fromJSON %>% mutate(country = .y)
}) %>% as_data_frame() %>% 
  mutate(grp = (command == 'moveto') %>% cumsum)

d %>% ggplot(aes(x, -y, group = grp, fill=country)) +
  geom_polygon(col='black', size=.3, alpha=.5) +
  coord_equal() + guides(fill=F)
geotheory
  • 22,624
  • 29
  • 119
  • 196
  • I've also submitted this to the [svg-path-parser](https://github.com/hughsk/svg-path-parser/issues/16) module on github – geotheory Sep 09 '18 at 09:57
  • I'm not familiar with R, but to me it looks like you divide the path into groups by looking for `closepath` commands, and then take the first `moveto` in each group as starting point to cumulate positions from for the conversion to absolute. Two sources of errors are:1. `moveto` commands, apart from the first one, can also be relative (to the last coordinate of the previous group). 2. Groups must not be closed with a `closepath` command. Searching for the opening `moveto` would be more reliable. – ccprog Sep 09 '18 at 15:42
  • Hi @ccprog. I do use `closepath` to create variable `grp` (that identifies unique polygons), but it does not have any role in parsing the actual coordinates. In fact I just use the SVG `relative` field which as I understand specifies when coordinates are relative or absolute. With absolute codes you have to also account for `H`/`V` commands, which inherit the inactive coordinate from the previous point. – geotheory Sep 09 '18 at 15:54

1 Answers1

1

Look at your rendering of Canada, especially the southern coast of the Hudson sound. There is a very obvious error. Sieveing through the path data, I found the following sequence in the original data:

h-2.28l-.91-.6-.92-.59H188.65l-1.03.02-.74-.07-.75-.07-.74-.07-.74-.06.88 1.09

I've loaded your rendering result into Inkscape, and drawn the relevant part of the path on top, the arrow marking the segment drawn by the absolute H command. (The z command has been removed, that is the reason for the missing segment.) It is obvious that somewhere in there a segment is too long.

enter image description here

It turns out the absolute H corrects the previous (horizontal) error. Look at the preceding point: it is 198., 143., but it should be 191.76,146.07. The vertical error remains at about -3.6.

I've made a codepen that overlays the original path data with your rendering as precisely as possible. The path data have been divided into the (single-polygon) groups and converted to absolute by Inkscape. Unfortunately, the program cannot convert them to polygon primitives, so there are still V and H commands in there.

It shows this:

  • The starting point of the path matches.
  • The point described by the absolute H command has a matching horizontal value, but not vertical. (It is the only absolute command in the whole path.)
  • Every path group (polygon) seems to be consistent in itself, but apart from group0 they all are removed from their intended place.

I've made some visual measurements of that deviation (error ~0.05), and they ultimately give the clue:

group01: 0.44,-0.73
group02: 0.84,-1.12
group03: 2.04,-1.44
group04: 2.94,-1.73
group05: 2.60,-1.86
group06: 3.14,-2.38
group07: 3.68,-2.54
group08: 4.03,-3.35
group09: 4.87,-2.97
group10: 6.08,-3.50 (begin)
group10: 0.00,-3.53 (end)
group11: 1.08,-1.95
group12: 2.05,-2.45
group13: 2.89,-2.84
group14: 3.64,-3.67
group15: 4.48,-3.44
group16: 4.04,-3.99
group17: 4.32,-3.08
group18: 4.75,-2.75
group19: 5.72,-2.95
group20: 5.40,-3.11
group21: 6.02,-2.95
group22: 6.63,-4.14
group23: 6.85,-5.00
group24: 7.14,-4.86
group25: 7.72,-4.39
group26: 8.65,-4.75
group27: 9.49,-4.39
group28: 10.20,-4.44
group29: 11.13,-4.58

You are removing the closepath commands, and then compute the first point of the next group relative to the last explicit point of the last group. But closepath actually moves the ccurrent point: back to the position of the last moveto command. These may, but need not be identical.

I can't give you a ready script in R, but what you need to do is this: at the beginning of a new group, cache the position of the first point. At the beginning of the next group, compute the new first point relative to that cached point.

ccprog
  • 20,308
  • 4
  • 27
  • 44
  • Thanks for help ccprog. Good to focus on specifics like this. So it's true I set the NA values for `y` variable of `H` commands to zero. But later I override that with `if(d2$code[i] == 'H') d2$y_adj[i] = d2$y_adj[i-1]` - basically inherit previous `y` value. I include relevant data.frame sections for `d` and `d2` - added to question. – geotheory Sep 09 '18 at 17:14
  • I understand, but I am absolutely sure the culprit is that absolute H command. I've added a screenshot to prove my point. – ccprog Sep 09 '18 at 17:42
  • No, the absolute H _corrects_ the (horizontal) error. Look at the preceding point: it is `198., 143.`, but it should be `191.76,146.07`. The vertical error remains. If, in addition, I account for the vertical error and move your rendering up by dy=-3.6, the very first point of the path data matches. As far as I can judge, all other path groups are internally consistent, but the further down in the path data they are, the more they are off to the bottom left. – ccprog Sep 09 '18 at 19:08
  • This is going to take some thinking. Will come back shortly. – geotheory Sep 09 '18 at 20:07
  • I've made a [codepen](https://codepen.io/anon/pen/WgdMMz) that overlays the original path data with your rendering as precisely as possible. The path data have been divided into the (single-polygon) groups and converted to absolute by Inkscape. Unfortunately, the program cannot convert them to polygon primitives, so there are still V and H commands in there. – ccprog Sep 09 '18 at 20:11
  • I've run a higher resolution plot which should help to clarify the precise nature of the anomaly - https://imgur.com/a/nsyDOnU – geotheory Sep 09 '18 at 20:48
  • Now I've found the error. This is the definite answer. – ccprog Sep 09 '18 at 23:18
  • You. Little. Ripper. This has to be it! Going to complicate my code but will get on it. – geotheory Sep 09 '18 at 23:24
  • ...and now I am scrolling down [this page](https://github.com/hughsk/svg-path-parser#absolute-path-commands) and see that the parser has an option for converting paths to absolute commands. That should make your live easier. – ccprog Sep 09 '18 at 23:28
  • Ho ho, quite an oversight! OK that simplifies everything considerably. The node console.log call becomes `console.log(JSON.stringify(parseSVG.makeAbsolute(parseSVG(d))))` and I group with `mutate(grp = (command == 'moveto') %>% cumsum)`. – geotheory Sep 09 '18 at 23:53
  • You've been massively helpful ccprog - thank you. It's been fun. – geotheory Sep 09 '18 at 23:54
  • And this is useful to me too, because although I don't need to draw Canada, I do need to parse SVG paths in R. I've just learnt something. – Phil van Kleur Oct 29 '20 at 12:48