-2

Using Go, how can I unmarshal a JSON string that contains unprintable ASCII characters?

For Example

testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
var dat map[string]interface{}
err := json.Unmarshal([]byte(testJsonString), &dat)
if err != nil {
    panic(err)
}

Yields:

panic: invalid character '\x10' in string literal

goroutine 1 [running]:
main.main()
    /tmp/sandbox903140350/main.go:14 +0x180

https://play.golang.org/p/mFGWzndDK8V

Unfortunately I do not have control over the source data, so I need a way to ignore or strip out the unprintable characters.

Similarly, another data issue I'm encountering is stripping out a few C escape sequences as well - like \0 and \a. If I replace string listed above with this string below, the program fails as well. Essentially it also fails on any C escape sequence https://en.wikipedia.org/wiki/Escape_sequences_in_C

testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"

will error out with

panic: invalid character 'a' in string escape code

goroutine 1 [running]:
main.main()
    /tmp/sandbox322770276/main.go:12 +0x100

This also seems to not be able to be unmarshaled, but is not able to be escaped through rune number checking or checking the unicode (since Go appears to treat it as a backslash followed by the character 'a', which are both legal)

Is there a good way to handle these edge cases?

Brian
  • 857
  • 2
  • 12
  • 25
  • For the first case, I am able to strip out all non-printable ascii by filtering out all runes with a value less than 32, however the C-style escape sequences are just the basic runes they represent (e.g. ... "55 56 57 92 97 49 50" in the above example) and throw the json decoder for a loop. – Brian Nov 04 '18 at 05:14
  • 2
    String literals in Source must be properly escaped. Just fix that. It is unrelated to JSON. – Volker Nov 04 '18 at 07:54
  • @Volker Since I am receiving these files from another group, I am unable to fix the data. It's unlikely they are going to modify their system to accommodate me either. – Brian Nov 04 '18 at 08:55
  • 3
    @Brian You misunderstand. The error you are seeing has to do with Go syntax. It’s not JSON-related, and wouldn’t ever happen when you read the JSON from somewhere other than your own code. – Biffen Nov 04 '18 at 10:58
  • @Biffen is right, you aren't really processing JSON in Go, as the input isn't compatible. Your question is really about "how can I do something with this messy data?" – Vorsprung Nov 04 '18 at 11:34
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jonathan Hall Nov 05 '18 at 06:35

1 Answers1

0

According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped (or converted to valid unicode escapes)

So here's a converter that makes non printable characters into their uri escaped forms. These can then be fed into the Unmarshal

If this isn't exactly the behaviour you need then modify the converter to remove the characters (with continue) or replace with a question mark rune or whatever

BTW, the second problem with \\a does not "print out as expected" for me. Please give a better example that actually shows the problem you are experiencing

    package main

    import (
        "bytes"
        "encoding/json"
        "fmt"
        "unicode"
        "net/url"
    )

func safety(d string) []byte {
    var buffer bytes.Buffer
    for _, c := range d {
        s := string(c)
        if c == 92 { // 92 is a backslash
          continue
        }
        if unicode.IsPrint(c) {        
            buffer.WriteString(s)
        } else {
            buffer.WriteString(url.QueryEscape(s))
        }
        fmt.Println(buffer.String())
    }
    return buffer.Bytes()
}

func main() {
    testJsonString := "{\"test_one\" : \"123\x10456\x0B789\v123\a456\"}"
    var dat map[string]interface{}
    err := json.Unmarshal(safety(testJsonString), &dat)
    if err != nil {
        panic(err)
    }
    fmt.Printf("%v", dat)
}
Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • Thanks for the tip! I'll try it out. For the second case, I neglected to mention that if I use `fmt.Println` to print out `testJsonString := "{\"test_one\" : \"123456789\\a123456\"}"` it appears to escape properly - but it fails to unmarshal – Brian Nov 04 '18 at 09:00
  • updated to specifically deal with other "C-type" escape sequences BUT it's not clear from the question if this is an acceptable way to do it. You need to think about how double escaped sequences should be dealt with. Maybe you need a parser – Vorsprung Nov 04 '18 at 10:52
  • ‘*According to the JSON spec https://jsonapi.org/format/ non printable characters should be URI escaped*’ No. According to [*the actual JSON specification*](https://tools.ietf.org/html/rfc8259#section-7) (JSON:API ≠ JSON), non-printable characters *may* be *`\u`-escaped*. – Biffen Nov 04 '18 at 10:56
  • @Biffen yes, I'd forgot about that! Seems the easiest thing in this case is to try uri escaping the dirty data – Vorsprung Nov 04 '18 at 11:15