0

Sadly it was announced that Google Reader will be shutdown mid of the year. Since I have a large amount of starred items in Google Reader I'd like to back them up. This is possible via Google Reader takeout. It produces a file in JSON format.

Now I would like to extract all of the article urls out of this several MB large file.

At first I thought it would be best to use a regex for url but it seems to be better to extract the needed article urls by a regex to find just the article urls. This will prevent to also extract other urls that are not needed.

Here is a short example how parts of the json file looks:

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

I just need the urls you can find here:

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

Perhaps anyone is in the mood to say how a regex have to look like to extract all these urls?

The benefit would be to have a quick and dirty way to extract starred items urls from Google Reader to import them in services like pocket or evernote, once processed.

Robert
  • 5,278
  • 43
  • 65
  • 115
  • Don't do this with regular expressions. **Regular expressions are not a magic wand that you wave at every problem that happens to involve text.** Use an existing, written, tested and debugged JSON library. – Andy Lester Apr 05 '13 at 15:50
  • By the way, I too am sad about the closure of google reader. I have found that http://www.bloglines.com/ is an acceptable replacement. While not as responsive as reader, it has a pretty similar interface otherwise, and can import the OPML directly from the Google Takeout XML output. – PaulProgrammer Apr 05 '13 at 17:13

1 Answers1

3

I know you asked about regex, but I think there's a better way to handle this problem. Multi-line regular expressions are a PITA, and in this case there's no need for that kind of brain damage.

I would start with grep, rather than a regex. The -A1 parameter says "return the line that matches, and one after":

grep -A1 "canonical" <file>

This will return lines like this:

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Then, I'd grep again for the href:

grep -A1 "canonical" <file> | grep "href"

giving

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

now I can use awk to get just the url:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' 

which strips out the first quote on the url:

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

Now I just need to get rid of the extra quote:

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

That's it!

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/
PaulProgrammer
  • 16,175
  • 4
  • 39
  • 56
  • great, that works as a charme. Could please give me a hint how i can additionally achieve to add some chars to every item? i'd like to make it like this
  • . – user2249443 Apr 05 '13 at 15:36
  • there are about 400 ways to do that too... perl might be my favorite: `... | perl -ne 'chomp; print "
  • $_
  • \n"'` – PaulProgrammer Apr 05 '13 at 17:00