1

I want to build a function that extracts jsons from strings in a generic way (for variable string Formats) with R on Windows.

Thanks to #SO I am using:

allJSONS <- gregexpr(
  pattern = "\\{(?:[^{}]|(?R))*?\\}",
  perl = TRUE,
  text = jsonString
) %>%
  regmatches(x = jsonString)

This works for some strings very well. For others the function Fails with a warning.

Error:

For some strings i get a warning / error:

Warning message: In gregexpr(pattern = "\{(?:[^{}]|(?R))*?\}", perl = TRUE, text = jsonString) : recursion limit reached in PCRE for element 1 consider increasing the C stack size for the R process

The Question was answered for Linux here: Error: C stack usage is too close to the limit. In the comments it was advised to ask a new Question with the Windows tag.

Reproducible example:

I uploaded Sample data on Github: https://github.com/TyGu1/findJSON/raw/master/jsonString.RData. (Direct download via load(url(…))) Fails for me somehow, but Manual download and using load() works for me.

(Note this is only sample data. I am looking for a generic solution.)

load(DOWNLOADED FILE)
allJSONS <- gregexpr(
  pattern = "\\{(?:[^{}]|(?R))*?\\}",
  perl = TRUE,
  text = jsonString
) %>%
  regmatches(x = jsonString)

Proof, that there is actually a JSON:

library(magrittr)  
library(jsonlite)

rp <- gsub(pattern = "memmCellmemm(", fixed = TRUE, replacement = "", x = jsonString)
rp <- substring(rp, first = 1, last = nchar(rp)-1) 
json <- rp %>% fromJSON

Goal:

Build a function that extracts jsons from strings in a generic way (for variable string Formats) with R on Windows.

I am Aware that i can extract the json with the provided Code:

rp <- gsub(pattern = "memmCellmemm(", fixed = TRUE, replacement = "", x = jsonString)
rp <- substring(rp, first = 1, last = nchar(rp)-1) 

but i would Need a more generic function, like the regex at the top, because the file Formats might be quite different across Input data.

Tlatwork
  • 1,445
  • 12
  • 35
  • In the string, du you expect one JSON substring or multiple (which are not nested within each other)? It's clear that a JSON substring of your non-JSON string will often contain multiple valid JSON sub-substrings itself --- do you want those as well? – Anders Ellern Bilgrau Jan 13 '20 at 22:11
  • thanks for the question! If they are multiple i would like to extract them all, yes. – Tlatwork Jan 15 '20 at 09:05

2 Answers2

2

That issue arises from this part in your regex: (?:[^{}]|(?R))*. By simply changing that into (?:[^{}]+|(?R))*, the problem disappears. I'm not sure why this happens, but I assume that you basically simplify the path regex has to use to test the string by doing this (testing whether the chars are not those VS that and if it should redo the pattern).

gregexpr(
  pattern = "\\{(?:[^{}]+|(?R))*?\\}",
  perl = TRUE,
  text = jsonString
)

#> [[1]]
#> [1] 14
#> attr(,"match.length")
#> [1] 134539
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE

If you want however a more complete solution, based on this answer, you can use something like this:

my_json_string = "adjkd({\"asdasd\": {\"asdasd\": 1234}}{\"asdasd\": 1234})"    
json_regexp = paste0(
    "(?(DEFINE)",
        "(?<number>-?(?=[1-9]|0(?!\\d))\\d+(\\.\\d+)?([eE][+-]?\\d+)?)",
        "(?<boolean>true|false|null)",
        "(?<string>\"([^\"\\\\]*|\\\\[\"\\\\bfnrt\\/]|\\\\u[0-9a-fA-F]{4})*\")",
        "(?<array>\\[(?:(?&json)(?:,(?&json))*)?\\s*\\])",
        "(?<pair>\\s*(?&string)\\s*:(?&json))",
        "(?<object>\\{(?:(?&pair)(?:,(?&pair))*)?\\s*\\})",
        "(?<json>\\s*(?:(?&object)|(?&array)|(?&number)|(?&boolean)|(?&string))\\s*)",
    ")",
    "(?&json)"
)

gregexpr(json_regexp, my_json_string, perl=T) 
    %>% regmatches(x = my_json_string)

#> [[1]]
#> [1] "{\"asdasd\": {\"asdasd\": 1234}}" "{\"asdasd\": 1234}"  

It also worked with your json string.

gregexpr(json_regexp, jsonString, perl=T)
#> [[1]]
#> [1] 14
#> attr(,"match.length")
#> [1] 134539
#> attr(,"index.type")
#> [1] "chars"
#> attr(,"useBytes")
#> [1] TRUE
#> attr(,"capture.start")
#>      number     boolean string   array pair object json
#> [1,]      0 0 0       0      0 0     0    0      0    0
#> attr(,"capture.length")
#>      number     boolean string   array pair object json
#> [1,]      0 0 0       0      0 0     0    0      0    0
#> attr(,"capture.names")
#>  [1] "number"  ""        ""        "boolean" "string"  ""        "array"   "pair"    "object"  "json"   
MkWTF
  • 1,372
  • 7
  • 11
1
/(([\x20\t\h\r\n]*)(?:(?:{(?:(?2)|,?((?2)"(?:[^\\"\x0\x1\x2\x3\x4\x5\x6\x7\x8\x9\xA\xB\xC\xD\xE\xF\x7F]|\\u[0-9A-F]{4}|\\t|\\r|\\n|\\f|\\b|\\\/|\\\\|\\"|\\")*"(?2)):(?1))+})|(?:\[(?:(?2)|,?(?1))+\])|(?3)|(?:-?(?:[1-9](?:[0-9]+)?|0)(?:\.[0-9]+)?(?:[Ee][+-]?[0-9]+)?)|true|false|null)(?2))/im

I wrote this RegEx for matching JSON in string with php using PCRE And it works. Its based on the this page. You are probably want a code in other language, But I don't have any idea of that language. But because its regex I think it can be be converted some how and will probably not have a lot difference.

ezio4df
  • 3,541
  • 6
  • 16
  • 31
  • thanks for your answer. I tried it but got some Errors while trying to Escape regex characters in R. Then the answer came which is reproducible, so i accepted it. Hope that ok for you. Upvoted your answer! – Tlatwork Jan 16 '20 at 21:12