Regex to extract all Starred Items URLs from Google Reader JSON file

Question

Sadly it was announced that Google Reader will be shutdown mid of the year. Since I have a large amount of starred items in Google Reader I'd like to back them up. This is possible via Google Reader takeout. It produces a file in JSON format.

Now I would like to extract all of the article urls out of this several MB large file.

At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.

Here is a short example how parts of the json file looks:

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

I just need the urls you can find here:

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?

The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.

Don't do this with regular expressions. **Regular expressions are not a magic wand that you wave at every problem that happens to involve text.** Use an existing, written, tested and debugged JSON library. — Andy Lester, Apr 05 '13 at 15:50
By the way, I too am sad about the closure of google reader. I have found that http://www.bloglines.com/ is an acceptable replacement. While not as responsive as reader, it has a pretty similar interface otherwise, and can import the OPML directly from the Google Takeout XML output. — PaulProgrammer, Apr 05 '13 at 17:13

score 3 · Accepted Answer · answered Apr 05 '13 at 14:43

I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.

I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":

grep -A1 "canonical" <file>

This will return lines like this:

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Then, I'd grep again for the href:

grep -A1 "canonical" <file> | grep "href"

giving

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

now I can use awk to get just the url:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'

which strips out the first quote on the url:

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Now I just need to get rid of the extra quote:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

That's it!

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

great, that works as a charme. Could please give me a hint how i can additionally achieve to add some chars to every item? i'd like to make it like this
there are about 400 ways to do that too... perl might be my favorite: `... | perl -ne 'chomp; print "

Regex to extract all Starred Items URLs from Google Reader JSON file

1 Answers1

Linked