2

I'm trying to parse some bizarre data structures I've encountered during my travels. Essentially they resemble python or javascript dictionaries, but both the key and the value are numeric:

weird <- "{47=4578.0005, 181=23456.7831, 216=7548.2367}"

In the hopes of converting this to a table, I tried transforming it into a more typical dictionary format and parsing it using jsonlite:

library(tidyverse)
library(jsonlite)
weird <- parse_json(str_replace(weird, "=", ":"))
#> Error: parse error: invalid object key (must be a string)

The parse_json function rightfully complains that the key isn't a string. I can certainly dissect all of this using str_split, but I was hoping a kind soul might have some insight into a more elegant/compact solution (hopefully avoiding regex altogether) that could parse this into a table as such:

tibble::tribble(
  ~ key,    ~ value,
  47,  4578.0005,
  181, 23456.7831,
  216,  7548.2367 
)
#> # A tibble: 3 x 2
#>     key  value
#>   <dbl>  <dbl>
#> 1    47  4578.
#> 2   181 23457.
#> 3   216  7548.

Thanks!

Xavier GB
  • 97
  • 1
  • 7

3 Answers3

3

JSON requires that its dictionary keys be quoted. While this solution also quotes the values, it should not be hard (if you are certain of the corpus of data) to apply as.numeric to the values.

library(dplyr)
library(tidyr)
gsub("=", ":", gsub("(\\b|-?)([0-9.]+)\\b", '"\\1"', weird)) %>%
  jsonlite::fromJSON(.) %>%
  enframe() %>%
  unnest(value)
# # A tibble: 3 x 2
#   name  value     
#   <chr> <chr>     
# 1 47    4578.0005 
# 2 181   23456.7831
# 3 216   7548.2367 

This assumes that all keys and values are completely numeric (possibly negative, possibly decimal), but no scientific notation in this pattern. It's not impossible to include that, so if you need to go a little further: https://www.regular-expressions.info/floatingpoint.html, Parsing scientific notation sensibly?, and Regex for numbers on scientific notation? might help.

The next step might be simply

gsub("=", ":", gsub("(\\b|-?)([0-9.]+)\\b", '"\\1"', weird)) %>%
  jsonlite::fromJSON(.) %>%
  enframe() %>%
  unnest(value) %>%
  mutate_all(as.numeric)
# # A tibble: 3 x 2
#    name  value
#   <dbl>  <dbl>
# 1    47  4578.
# 2   181 23457.
# 3   216  7548.

(The apparent loss of decimal places is just a presentation thing, the raw data still has non-integer values.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • This is super helpful, I am particularly fond of the `enframe` function, that is a little gem I will surely remember! When I initally posted the question, I immediately doubled back and edited my question to specify I was hoping to avoid regex. I think you posted your asnwer before those edits were complete, so you answered my unedited question perfectly. However, I suspect that like many users I tend to avoid regex when possible because it's complexity is daunting. – Xavier GB Apr 20 '20 at 14:04
  • Okay, I understand. Brian's answer is the closest to being without regex, though `str_remove_all` will need to be replaced. To remove the `{}`, you need `substr`, but it will blindly do so without regards to what the first/last characters *are*. Curious, what's your rationale for *"avoiding regex"*? – r2evans Apr 20 '20 at 15:13
  • I find them difficult to read, and I often make super avoidable mistakes that take me so long to fix. They're obviously a powerful tool that I should get more comfortable with, but given the abundance of parsing functions designed for most common problems, I so rarely require them that the cost-benefit seems low for my day to day. – Xavier GB Apr 20 '20 at 15:53
  • 1
    I understand, and https://xkcd.com/1171/. If you check with `stopifnot(startsWith(weird, "{"), endsWith(weird, "}"))` (verify that the first/last chars really are what you expect), and then use `substr(weird, 2, nchar(weird)-1)`, you can use Brian's answer *after* the `str_remove_all` which is using a regex. The other `str_split` calls (which are, IMO, not better than base R's `strsplit`) are using fixed patterns (within a normally-regex argument). – r2evans Apr 20 '20 at 16:59
3

While you may want a compact, non-regex version, this seems the most digestible and easy to follow to me:

library(stringr)
library(dplyr)

weird <- "{47=4578.0005, 181=23456.7831, 216=7548.2367}"

str_remove_all(weird, "[{}]") %>% 
  str_split(", ", simplify = TRUE) %>% 
  str_split("=", simplify = TRUE) %>% 
  as_tibble() %>% 
  mutate_all(as.numeric)

#> # A tibble: 3 x 2
#>      V1     V2
#>   <dbl>  <dbl>
#> 1    47  4578.
#> 2   181 23457.
#> 3   216  7548.

Created on 2020-04-20 by the reprex package (v0.3.0)

I don't mind making three passes over the data, accomplishing only one thing at each step. I find it much easier to read and reason about.


Update:

While I still think the version above is the R-idiomatic way to do it, it's not an R-idiomatic data structure, so here's a solution using Python via {reticulate}. This is a little more involved than using the JSON parser of {jsonlite} but it turns out you can use numeric values as keys in Python dicts.

We first have to convert the = to : and then the string becomes a valid dict. Then we can start a Python context, run a string assigning that to a variable, and then return the state to R. Then It's just converting the list in R into a dataframe.

library(reticulate)
library(stringr)
library(glue)
library(tibble)

weird <- '{47=4578.0005, 181=23456.7831, 216=7548.2367}'

weird <- str_replace_all(weird, "=", ": ")

x <- py_run_string(
  glue("weird_dict = {weird}"), 
  convert = T
  )

x$weird_dict %>% 
  unlist() %>% 
  enframe()
#> # A tibble: 3 × 2
#>   name   value
#>   <chr>  <dbl>
#> 1 47     4578.
#> 2 181   23457.
#> 3 216    7548.

Created on 2021-10-21 by the reprex package (v2.0.1)

Brian
  • 7,900
  • 1
  • 27
  • 41
  • 2
    This might be more robust than the `jsonlite` method when data becomes not-always-numeric, assuming (of course) there are no embedded/escaped commas or equal signs. – r2evans Apr 20 '20 at 05:09
  • My dream approach was something like `parse(text = glue("list({weird})")` but unfortunately list names in R can't be bare numbers either. – Brian Apr 20 '20 at 05:12
3

I don't think you can avoid regex altogether but with some minor substitutions it can be read easily by read.table().

wt <- c("{47=4578.0005, 181=23456.7831, 216=7548.2367}")

read.table(text = gsub("[{},]", "\n", wt), sep = "=")

   V1        V2
1  47  4578.001
2 181 23456.783
3 216  7548.237
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56