1

I have a JSON string which I was able to scrape from a website. I only needed the following data (the original string is much longer) Here is the retrieved JSON which I am planning to convert into a Ruby Hash:

{"day": 15, "month": 03, "year": 2012, "hour": 10, "min": 00, "sec": 00}

I retrieved the above json by using the regex:

targetDate:\s+(.*?)}\)/m

I am not able to parse the above json because of the extra zeroes in the integers. (00 and 03) I tried changing the numbers manually using 3 instead of 03 and 0 instead of 00 and it worked!

So, I guess that the json parser may not be able to look at that kind of number.

The question is, how do I clean the retrieved JSON above so as to remove the unnecessary zeroes. That is,

{"day": 15, "month": 3, "year": 2012, "hour": 10, "min": 0, "sec": 0}

Thanks for all the help!

nmenego
  • 846
  • 3
  • 17
  • 36
  • This is the error message when I try to parse it: _710: unexpected token at '{"day": 15, "month": 03, "year": 2012, "hour": 10, "min": 00, "sec": 00}'_ – nmenego Mar 14 '12 at 07:22
  • You are correct. I forgot JSON *forbade* numbers (except 0.xyz) begin with zero. –  Mar 14 '12 at 07:24
  • Minor correction: JSON Forbids any number except *`0` itself* and `0.xyz` to begin with zero. – Mad Physicist Jul 05 '17 at 19:11

3 Answers3

1

Try this regexp

json = '{"day": 15, "month": 03, "year": 2012, "hour": 10, "min": 00, "sec": 00}'
json.gsub(/\b0*(\d+)/, '\1')
#=> {"day": 15, "month": 3, "year": 2012, "hour": 10, "min": 0, "sec": 0}

EDIT:

Although not strictly necessary (see comments), the \b word boundary ensures that only zeros at the start of a number can be matched.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
p0deje
  • 3,903
  • 1
  • 26
  • 37
  • Will change `"year": 2012` into `"year": 212` and remove all but last zeroes from all numbers – kirilloid Mar 14 '12 at 07:27
  • @kirilloid: I had thought so, too, but actually it doesn't do this. When matching `2012`, the `0*` part looks at the `2` and fails, so `\d+` matches `2012` and all is well. But I would make it explicit by adding a `\b` word boundary at the start regardless since this is somewhat counterintuitive. – Tim Pietzcker Mar 14 '12 at 07:30
  • Thanks for this. It actually worked. `test = JSON.parse('{"day": 15, "month": 03, "year": 2012, "hour": 10, "min": 00, "sec": 00}'.gsub(/0*(\d+)/, '\1')) # => test['year'] gives 2012 – nmenego Mar 14 '12 at 07:34
  • @TimPietzcker, I think you are right. I'll try to make my version more intuitive and post it back here. Thanks! – nmenego Mar 14 '12 at 07:36
  • 1
    @TimPietzcker Hmm. This really works ok. I still could provide a *bad* example: `{"phone-number":"123-05-07"}` – kirilloid Mar 14 '12 at 07:50
  • @kirilloid: Word boundaries wouldn't help here either. – Tim Pietzcker Mar 14 '12 at 08:02
  • Guys, check out @pguardiario's answer. It does not involve regex. – nmenego Mar 14 '12 at 08:04
  • @nmenego http://stackoverflow.com/questions/1902744/when-is-eval-in-ruby-justified – kirilloid Mar 14 '12 at 08:24
1

Rather than bring in regex, maybe just eval it:

hash = eval '{"day": 15, "month": 03, "year": 2012, "hour": 10, "min": 00, "sec": 00}'.gsub(': ', ' => ')
pguardiario
  • 53,827
  • 19
  • 119
  • 159
  • Hey! It's you again! Your answer does work after testing it. Much simpler and cleaner! Thanks! – nmenego Mar 14 '12 at 08:00
  • 1
    You're welcome. Don't do it that way if there might be malicious code in there though. – pguardiario Mar 14 '12 at 08:21
  • It's okay. I'll just be using this for practice's sake. But as I have been reading, they say it is not a generally good idea to use eval. – nmenego Mar 14 '12 at 08:43
0
json.gsub(/(?<=[: ])0+(\d+,)/, "\\1")

Keep in mind, that you may have json like { "someKey": "james bond: 007" }, which will be replaced to { "someKey": "james bond: 7" }.

json.gsub(/("\w")\s+:\s+0+(\d+)\s+,/, "\\1: \\2,")

is looking better, but it possible to "outsmart" this regexp too. Regexp aren't well-suited for such problems.

Ok, here're non-regexp solution:

var inString = false; # check, whether current char is in string. Think of it as whether current symbol would be highlighted as string constant in editor
var out = []; # array/stack for output
var prevChar = null; # previous char. One may init to space symbol or even replace it with `out[-1]` everywhere
for (chr in jsonStr) { # iterate over symbols (chars) of a string
    if (char == '"' && prevChar != "\\") inString = !inString;
    if (!isDigit(out[-2])
    &&  prevChar == '0'
    &&  isDigit(chr)) { # i.e. last 3 chars match /(\D)0(\d)/
        out[-1] = prevChar = chr; # make it \1\2
    } else {
        out.push(prevChar = chr); # just continue building string
    }
}
out.join("");

Consider it pseudo-code like javascript, and not tested.

kirilloid
  • 14,011
  • 6
  • 38
  • 52