4

I'm trying to use a PCRE regular expression to extract some JSON. I'm using a version of MariaDB which does not have JSON functions but does have REGEX functions.

My string is:

{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush"],"carriers":[],"exclude_carriers":[]}

I want to grab the contents of category. I'd like a matching group that contains 2 items, Jebb and Bush (or however many items are in the array).

I've tried this pattern but it only matches the first occurrence: /(?<=category":\[).([^"]*).*?(?=\])/g

GGGforce
  • 634
  • 1
  • 8
  • 19
  • 4
    One wonders why you're pushing JSON to the DB if you need access to some of the underlying contents within the DB itself? Why not push the data you actually need? – James Thorpe Mar 30 '16 at 11:48
  • One business intelligence type wonders the same of his developer friends. Any help with the regex pattern would be great! – GGGforce Mar 30 '16 at 11:54
  • 1
    You might want to tag this with mariadb as well - it may be more about how you're using the functions within that context than the regex itself – James Thorpe Mar 30 '16 at 11:57
  • 3
    Perl, PHP, JS, etc, etc, have routines for parsing JSON. Do it in application code. – Rick James Mar 30 '16 at 23:48
  • 4
    Yuck, regex for JSON – Ruan Mendes Mar 31 '16 at 12:56
  • @ GGGforce Check my answer below for a single regex solution to your problem, working perfectly fine for indefinite number of entries. – Redu Mar 31 '16 at 13:14
  • I will @Redu, just after lunch! ;) – GGGforce Mar 31 '16 at 13:16
  • I'm accepting ClasGs's answer because in MariaDB I need 1 match with capturing groups instead of multiple matches. ClasG: https://regex101.com/r/jD1rN6/1 Redu: https://regex101.com/r/rU6nK8/1 – GGGforce Mar 31 '16 at 14:46

7 Answers7

3

Does this match your needs? It should match the category array regardless of its size.

"category":(\[.*?\])

regex101 example

Lars de Bruijn
  • 1,430
  • 10
  • 15
2

JSON not a regular language. Since it allows arbitrary embedding of balanced
delimiters, it must be at least context-free.

For example, consider an array of arrays of arrays:

[ [ [ 1, 2], [2, 3] ] , [ [ 3, 4], [ 4, 5] ] ]
Clearly you couldn't parse that with true regular expressions.
See This Topic: Regex for parsing single key: values out of JSON in Javascript Maybe Helpful for you.

Community
  • 1
  • 1
Mahmoud.Eskandari
  • 1,460
  • 3
  • 20
  • 32
2

Using a set of non-capturing group you can extract a predefined json array

regex answer: (?:\"category\":)(?:\[)(.*)(?:\"\])

That expression extract "category":["Jebb","Bush"], so access the first group to extract the array, sample java code:

Pattern pattern = Pattern.compile("(?:\"category\":)(?:\\[)(.*)(?:\"\\])");        
String body = "{\"device_types\":[\"smartphone\"],\"isps\":[\"a\",\"B\"],\"network_types\":[],\"countries\":[],\"category\":[\"Jebb\",\"Bush\"],\"carriers\":[],\"exclude_carriers\":[]}";
Matcher matcher = pattern.matcher(body);
assertThat(matcher.find(), is(true));
String[] categories = matcher.group(1).replaceAll("\"","").split(",");

assertThat(categories.length, is(2));
assertThat(categories[0], is("Jebb"));
assertThat(categories[1], is("Bush"));
Alisson Gomes
  • 1,029
  • 14
  • 30
0

There are many ways. One sloppy way to do it is /([A-Z])\w+/g

Please try it on your console like

var data = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush"],"carriers":[],"exclude_carriers":[]}',
     res = [];
data.match(/([A-Z])\w+/g); // ["Jebb", "Bush"]

OK the above was pretty sloppy however a solid single regex solution to extract every single element regardless of the number, one by one and to place them in an array (res) is the following...

var rex = /[",]+(\w*)(?=[",\w]*"],"carriers)/g,
    str = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush","Donald","Trump"],"carriers":[],"exclude_carriers":[]}',
    arr = [],
    res = [];
while ((arr = rex.exec(str)) !== null) {
  res.push(arr[1]); // <- ["Jebb", "Bush", "Donald", "Trump"]
}

Check it out @ http://regexr.com/3d4ee

OK lets do it. I have come up with a devilish idea. If JS had look-behinds this could have been done simply by reversing the applied logic in the previous example where i had used a look-forward. Alas, there aren't... So i decided to turn the world the other way around. Check this out.

String.prototype.reverse = function(){
                             return this.split("").reverse().join("");
                           };
var rex = /[",]+(\w*)(?=[",\w]*"\[:"yrogetac)/g,
    str = '{"device_types":["smartphone"],"isps":["a","B"],"network_types":[],"countries":[],"category":["Jebb","Bush","Donald","Trump"],"carriers":[],"exclude_carriers":[]}',
    rev = str.reverse();
    arr = [],
    res = [];
    while ((arr = rex.exec(rev)) !== null) {
      res.push(arr[1].reverse()); // <- ["Trump", "Donald", "Bush", "Jebb"]
    }
res.reverse(); // <- ["Jebb", "Bush", "Donald", "Trump"]

Just use your console to confirm.

Redu
  • 25,060
  • 6
  • 56
  • 76
  • Array items within the other objects might start with a uppercase letter, such as Smartphone. This solution seems a bit brittle. – GGGforce Mar 30 '16 at 12:56
  • The sloppy one was pretty easy to figure out but then to come up with a single regex to sort out the array you needed was harder than i have imagined at first. In fact it was a real bitch... Introducing `/[",]+(\w*)(?=[",\w]*"],"carriers)/g` – Redu Mar 31 '16 at 12:24
  • unfortunately this pattern doesn't grab whatever is in the `categories` field but instead whatever is just before the `carriers` field. @ClasG's answer, while not as dynamic as yours, specifically grabs the `category` field. I can't be sure `category` is always right in front of `carriers`. – GGGforce Mar 31 '16 at 13:39
  • Well... i thought your JSON structure was fixed. So whatever... multiple matches are not that easy in this one. In JS there are no lookbehinds which could be handy. I have to think about it. I'll be back.. hasta la vista... baby..! – Redu Mar 31 '16 at 13:49
  • What I'm thinking is that using SQL I can strip everything in the string before `category`so my resulting string is `category":["Jebb","Bush","Donald","Trump"],"carriers":["a","B"],"exclude_carriers":["a","B"]}`. Now I need a way to stop capturing in the right place. Some variation of this: `/[",]+(\w*)(?=[",\w]*"])/g` – GGGforce Mar 31 '16 at 13:54
  • I'm accepting ClasG's answer due to the way MariaDB works, I'll post an explanation briefly. Thanks for your help, Redu. – GGGforce Mar 31 '16 at 14:41
  • @GGGforce You are welcome.. but how can you make sure how many categories will be listed.. What if you hit more than 5 categories..? This regex's got under my skin... i will work on it a little longer... – Redu Mar 31 '16 at 15:01
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/107870/discussion-between-gggforce-and-redu). – GGGforce Mar 31 '16 at 15:05
  • I am pretty sure this time you should be satisfied. It's done..! – Redu Mar 31 '16 at 19:52
0

In c++ you can do it like this

bool foundmatch = false;
try {
    std::regex re("\"([a-zA-Z]+)\"*.:*.\\[[^\\]\r\n]+\\]");
    foundmatch = std::regex_search(subject, re);
} catch (std::regex_error& e) {
    // Syntax error in the regular expression
}
Peter
  • 1,124
  • 14
  • 17
0
(?<=category":\[).[^\]]*
Yu Va
  • 1
-1

If the number of items in the array is limited (and manageable), you could define it with a finite number of optional items. Like this one with a maximum of 5 items:

"category":\["([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)"(?:,"([^"]*)")?)?)?)?

regex101 example here.

Regards.

SamWhan
  • 8,296
  • 1
  • 18
  • 45
  • So far this is the only pattern that is working for me. I'll wait until tomorrow to see if some absolute genius can top it then mark as answer. – GGGforce Mar 30 '16 at 13:28