1

I am learning how to use regex lookahead and lookbehind.

I want to extract json value from a text just like this

{"html":"\n\t\t\t\t<table class=\"table\">"} 

I am using regex on C# like this

 Regex.Match(text, "\"html\":\"([^(?<!\\\\)\"]*)").Groups[1].Value

Or

 Regex.Match(text, "\"html\":\"((?<!\\\\$)[^\"]*)").Groups[1].Value

But its not working at all. Can I get this value using C# regex ?

Barun
  • 1,885
  • 3
  • 27
  • 47
  • Why not use `JavaScriptSerializer`? Why regexp? – Amadan Nov 15 '13 at 19:08
  • @Amadan Because I am learning Regex lookahead and lookbehind. – Barun Nov 15 '13 at 19:10
  • 1
    lookahead and lookbehind are called called zero length assertions. This means they return whether a match was found but not what it matched. http://www.regular-expressions.info/lookaround.html – Mike Cheel Nov 15 '13 at 19:16
  • @MikeCheel Exactly! I want to make sure that [^"]* not ends with a "\" character – Barun Nov 15 '13 at 19:18

2 Answers2

4

There is a completely perfect for you tool which is exactly what you need in this case of parsing JSON objects.

Alright, in case you are learning Regex, here is your example of retrieving JSON data:

class Program
{
    static void Main(string[] args)
    {
        // {"html":"\n\t\t\t\t<table class=\"table\">"} 
        var s = "{\"html\":\"\n\t\t\t\t<table class=\\\"table\\\">\"}";
        Console.WriteLine("\"{0}\"", ParseJson("html", s).First());
        // You might wanna do Trim() on the string because of those \t\t\t etc.
    }
    static private IEnumerable<string> ParseJson(string key, string input)
    {
        Regex r = new Regex(@"\{\""" + key + "\""\:\""(.*?)(?<!\\)\""\}", RegexOptions.Singleline);
        return r.Matches(input).Cast<Match>().Select(T => T.Groups[1].Value);
    }
}

A few notes:

  1. Use (?<!\\) as a negative lookbehind (from here) for the doublequotes not preceded by a backslash.
  2. Use RegexOptions.Singleline for the dot (.) character to match the newline chars (\r & \n).
  3. Do not parse HTML with regex :)
Community
  • 1
  • 1
AgentFire
  • 8,944
  • 8
  • 43
  • 90
1
/"html":"(?:[^"]*(\\"[^"]*(?<!\\)))*"/

        -                              opening quote
            -----    -----         -     then any number of non-quotes
                 ----              -     ... separated by an escaped quote
                          -------        ... where the non-quote string doesn't
                                              end in a backslash
                                    -    closing quote   

should be a good-enough approximation for this case.

(I've written it in the standard regexp way; remember to escape backslashes and quotes for the C# string literal.)

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • Actually my json value is very big. And this code is not working there. I just posted here the beginning of the json. – Barun Nov 15 '13 at 19:40
  • As I said, works for this case. I can imagine it wouldn't work if you had an escaped backslash before an unescaped quote, for example, or non-string values, or values whose keys aren't `"html"`. Which is why, for real data, you use a parser with more state than regular automaton, and preferably something already written and tested (like `JavaScriptSerializer`). Or at least use a language with a powerful "regex", like [Perl](http://www.perlmonks.org/?node_id=995856), which implements recursion, and is thus not "regular" any more. – Amadan Nov 15 '13 at 20:05