0

In code below should accept a multiline as string input as either JSON or YAML. It firsts attempts to read the input as JSON, and if JSON failed it makes a second attempt to read it as YAML if both failed return error.

Now the problem is with yaml.Unmarshal(). I check, it never returns an error if the input is JSON string. (correct or incorrect). The main issue yaml.Unmarshal never returns an error.

Initially, I thought it error on yaml.Unmarshal implementation, but it looks to me it makes the best effort when parsing input, and structure doesn't violate yaml; it never returns an error.

func SpecsFromString(str string) (*Something, error) {
    r := strings.NewReader(str)
    return ReadTenantsSpec(r)
}

func ReadSpec(b io.Reader) (*Spec, error) {

    var spec Spec

    buffer, err := ioutil.ReadAll(b)
    if err != nil {
        return nil, err
    }

    err = json.Unmarshal(buffer, &spec)
    if err == nil {
        return &spec, nil
    }

    err = yaml.Unmarshal(buffer, &spec)
    if err == nil {
        return &spec, nil
    }

    return nil, &InvalidTenantsSpec{"unknown format"}
}

So my question how to properly do this test if the input is JSON or YAML? It looks to me that the only way to do that on JSON unmashler differentiates two error cases. The reason is JSON generally more strict in structure. When the input is not JSON at all, and when input is JSON but an error in the structure of JSON. That second case will allow me never to call the yaml parser in the first place.

Maybe someone comes up with a more neat solution?

Thank you.

Johnny Bonelli
  • 101
  • 1
  • 8
  • Would a regex work to detect whether the input is JSON first? https://stackoverflow.com/a/3710506/13906951 – Clark McCauley Jun 28 '21 at 02:17
  • Not really. Because parse must be intelligent enough to report two different cases for error. I checked go JSON reports only one error incorrect format. Imagine that regex check passed, JSON parse later down a line failed, I attempt to call yaml and it never fails... I think the only way enforce value in yaml spec that must always present. – Johnny Bonelli Jun 28 '21 at 02:25
  • 6
    [YAML is a superset of JSON](https://stackoverflow.com/q/1726802). If it’s an option for you, I would change the function params to require specifying the input format, instead of writing a format-agnostic implementation. – blackgreen Jun 28 '21 at 02:25
  • @blackgreen I agree I think the only way to go. Enforce specific mandatory values in yaml. – Johnny Bonelli Jun 28 '21 at 02:26
  • Couldn't you write a middleware reader that scored input as it processed text, such that occurrences of features distinct to either would be counted, and whichever wins be assumed appropriate? I mean, example rules might be that any line that contains no quotes or braces would be point to YAML, and that each balanced pair of square- or curly-braces is a point for JSON. It's not infallible, and @blackgreen is quite right to suggest that a flag be passed or separate keys be defined or something. – Sam Hughes Jun 28 '21 at 03:24
  • @SamHughes that exactly my point. It nice to see from the interface itself. i.e Unmarshal a state. i.e the number of entries it parsed for example. Of course, I can go and scan the entire 100 fields in a struct and check what initialized what not and who wins. As far as I know, Go doesn't expose the internal struct field as a list that internal to reflects itself, otherwise yes it can be done in more generic code. – Johnny Bonelli Jun 28 '21 at 04:36
  • 1
    @JohnnyBonelli, I added a rudimentary example in an answer. In my example, I don't have any specific scoring criteria, but I demonstrate wrapping the reader and evaluating a portion of the processed input. I used the newline character as a delimiter, but it can be literally anything that you think would give an accurate frame, up to the entire contents of any source files. – Sam Hughes Jun 28 '21 at 09:15
  • @SamHughes there are no "features distinct to" JSON. [Per the YAML spec](http://yaml.org/spec/1.2/spec.html#id2759572), "every JSON file is also a valid YAML file". – Adrian Jun 28 '21 at 13:23
  • @Adrian, noted. I find BlackGreen's suggestion quite practical. Still, in the case where a mis-parse is unreasonable, I'm suggesting a profiler approach. A valid YAML file may have a line described as /\w+:\s\w+/, but that feature of the document, if structural and not inside a string, would be invalid in JSON. Profiling point, YAML. Braces can occur in YAML, but they're superfluous. If balanced braces are observed, profiler points, JSON. Again, though, the profiler suggestion is prefaced with the stipulation that requiring a flag is eminently more reliable. – Sam Hughes Jun 28 '21 at 14:05
  • Here's an alternative approach: Assume the client isn't lying, and let it error if they are. If it's a commandline application, use two different flags for YAML or JSON input. If it takes a filename as an argument, check its file extension. If it's a HTTP server, look at the HTTP request Content-Type header. Depending on what the user or client says it is, pass it naively to the parser and just let it crash - it's a user error if they don't pass what they say they pass. – cthulhu Jun 28 '21 at 14:56

3 Answers3

1

json.Unmarshal does return SyntaxError on invalid JSON syntax and has other, different errors when the syntax is correct but unmarshaling fails, so you can use that to differentiate.

Concerning YAML, if you use yaml.v3, you can write a custom unmarshaler to access the Node representation of your input, and check whether the root node has the Style Flow set, which means JSON-like syntax. However, YAML is far more permissive even with this syntax (e.g. strings do not need to be quoted, trailing commas in sequences and mappings are allowed) and while you can check the quoting style of contained scalars, the information available will not be enough to ensure that the input is JSON-parseable (trailing commas cannot be detected via this interface).

So the proper way to check whether the input is syntactically valid JSON is to check the returned error of json.Unmarshal.

flyx
  • 35,506
  • 7
  • 89
  • 126
  • that was my initial attempt to fix it, but if you pass the yaml file It same error SyntaxError, if you check the code I provided in both cases if it Json or None JSON (i.e yaml file) err != nil, so code moves to parse Yaml and it takes JSON input and returns nothing and last error never returned but I'll cross-check again. – Johnny Bonelli Jun 28 '21 at 10:07
  • @JohnnyBonelli Well yes you need to modify the error check, e.g. by adding `if _, ok := err.(*json.SyntaxError); !ok { return nil, &InvalidTenantsSpec{"wrong JSON structure"} }` – flyx Jun 28 '21 at 13:12
1

This is what I was referencing in my comment on the question. This is a simplistic example of a middleware reader.

  1. This pattern allows you to avoid having to fully parse the text body, in case it's unreasonably large
  2. It ideally has no effect on downstream operations, providing a transparent API.

From your example, you'd call something like:

b = WrapReader(b)
buffer, err := ioutil.ReadAll(b)
if err != nil {return nil, err}
if b.Writable.A > b.Writable.B {
    err = json.Unmarshal(buffer, &spec)
}
if err != nil || b.Writable.A <= b.Writable.B {
    err = yaml.Unmarshal(buffer, &spec)
}

Effectively, it doesn't change the interface you're dealing with, while gaining some control over how the process goes down. There's plenty of room for improvement, but the above API is offered by the code below:

type Line []byte
type Writable struct {
    Line
    A int
    B int
}
type Decision struct{
    io.Reader
    Writable
}
func (d *Decision) Read(b_rx []byte) (int, error) {
    n, err := d.Reader.Read(b_rx)
    if err != nil && err != io.EOF {return n, err}
    for _, b_tx := range b_rx {
        d.Writable.WriteByte(b_tx)
    }
    return n, nil       
}
func (w *Writable) WriteByte (b byte) error {
    if b == '\n' {
        pJSON, pYAML, err := w.Score()
        if err != nil {return err}
        w.A += pJSON
        w.B += pYAML
        w.Line = make(Line, 0)
    } else {
        w.Line = append(w.Line, b)
    }
    return nil
}
func (w *Writable) Score () (int, int, error) {
    //whatever scoring heuristics you can think of.
    return 0,0,nil
}
func WrapReader(b io.Reader) io.Reader {
    return Decision{b,*new(Writable)}
}
Sam Hughes
  • 665
  • 8
  • 10
  • I was checking now I can wrap around Yaml and Json itself and report a number of errors encountered, but I need to check your idea. Thank you very much for taking the time and explaining your answer. – Johnny Bonelli Jun 28 '21 at 10:12
  • 1
    But a comment on the general approach: While this might work for some common types of input, where YAML and JSON are vastly different, it's probably a poor approach in a technical sense, since you could have a file that looks like 100% valid JSON, except for a single character that's not valid in JSON, but is valid in YAML. The only fool-proof way to know if something is valid YAML and not valid JSON, is to try to parse it as both. – Jonathan Hall Jun 28 '21 at 10:28
  • 1
    As a simple example, a 3-gigabyte array of objects might be valid JSON until a final trailing comma before the closing bracket: `,]`. This would render it invalid JSON, but valid YAML, and this would not likely trigger any "heuristics" other than a strict JSON syntax check. – Jonathan Hall Jun 28 '21 at 10:31
  • @Flimzy Hahaha! Yeah. You were right. This was hastily written, before bed, so I didn't do a lot of proof-reading. Meanwhile, a comma before a square- or curly- brace is absolutely something to penalize the "JSON" decision for. I agree with you that this method could fail, but if there is some meaningful cost to choosing the wrong parser, an approach like this is viable. You point out rightly that my approach was quite naive, and I agree. A hypothetically reliable, production-worthy profiler would be much more robust. – Sam Hughes Jun 28 '21 at 13:46
  • Using heurstics as a cheap "might work" approach could very well be a good approach. I'd probably fall back to an actual parser in case the heuristics are inconclusive. – Jonathan Hall Jun 28 '21 at 14:05
0

I came across the same problem some days ago in bash scripting: How can I detect if a file contains json, yaml or plain text?

My solution was:

process as json

  • can be parsed as json without errors

process as text

  • can be parsed as yaml, but type is just a yaml string

process as yaml

  • can be parsed as yaml, but is not just a yaml string
  • cannot be parsed as json

Bash scripting snippet

parse_as_json() {
  jq -e '.' > /dev/null 2>&1 < "$1"
}

parse_as_yaml() {
  local FILE=$1
  parse_as_json $FILE && return 1
  parse_as_text $FILE && return 1
  yq -e > /dev/null 2>&1 < $FILE || return 1
}

parse_as_text() {
  [[ $(yq 'type == "string"' 2>&1 < $1) == true ]]
}
jpseng
  • 1,618
  • 6
  • 18