0

I have the following command to grab a json in unix:

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json

Which gives me the following output format (with different results each time obviously):

{
 "kind": "...",
 "data": {
 "modhash": "",
 "whitelist_status": "...",
 "children": [
 e1,
 e2,
 e3,
 ...
 ],
 "after": "...",
 "before": "..."
 }
}

where each element of the array children is an object structured as follows:

{
 "kind": "...",
 "data": {
 ...
 }
}

Here is an example of a complete .json get (body is too long to post directly: https://pastebin.com/20p4kk3u

I need to print the complete data object as present inside each element of the array children. I know I need pipe atleast twice, to initially get children [...], then data {...} from there on, and this is what I have so far:

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'

I'm new to regular expressions, so I'm not sure how to handle having brackets or curly braces within elements of what I'm grepping. The line above prints nothing to the shell and I'm not sure why. Any help is appreciated.

Anthony B
  • 55
  • 6
  • 2
    Are you open to using third party utilities ? I generally use jq binary to parse json data easily. For your requirement, you just need to pass the json data to jq which has an internal query language: cat /tmp/data | jq '.data.children | .[]' (Here /tmp/data contains the complete json). By using such utilities you can actually get the work done with shorter queries and advanced functionalities like raw output, queries etc. – akskap Oct 21 '17 at 18:50
  • Well, the end goal of acquiring the data{} isn't the sole objective; this time it just happens to be a .json format, but I'd like to know how to do this via regex for any files. – Anthony B Oct 21 '17 at 18:55
  • Is regex your only option? In my opinion regex is not the right tool for the job here. Would you consider something like python using the json package? – Perennial Feb 21 '19 at 03:13

2 Answers2

1

If you want to get the children array try this but i'm not sure it's what you look for.

wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'
ctac_
  • 2,413
  • 2
  • 7
  • 17
1

Code

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)'

Something about regex

* == zero or more time
+ == one or more time
? == zero or one time
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_)
\d == all numbers from 0 to 9
\r == carriage return
\n == new line character (line feed)
\ == escape special characters so they can to be read as normal characters
[...] == search for character class. Example: [abc] search for a or b or c
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured.
\K == match start at this position.

Anyway you can read more about regex from here: Regex Tutorial

Now i can try to explain the code

wget download the source.
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep.
grep -o option is used for only matching.
grep -P option is for perl regexp.

So here
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])'
we have sayed:
match all the line from "children"
zero or more spaces
:
zero or more spaces
\[ escaped so it's a simple character and not a special
zero or more spaces
\K force submatch to start from here
( submatch
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?)
) close submatch
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch.
Darby_Crash
  • 446
  • 3
  • 6
  • Thanks for the detailed explanation, was very helpful. Followup question, what would the difference be if one was to use egrep without perl regex syntax? – Anthony B Oct 21 '17 at 20:18
  • Take a look here: https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions – Darby_Crash Oct 21 '17 at 20:42