0

The html i'm dealing with looks a lil like this

<a class="title may-blank" data-event-action="title" href="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" tabindex="1" data-href-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/" data-inbound-url="/r/gaming/comments/6t8dj0/we_can_play_singleplayer_games_off_the_internet/?utm_content=title&amp;utm_medium=hot&amp;utm_source=reddit&amp;utm_name=frontpage" rel="">We can play singleplayer games OFF THE INTERNET? Are they seriously that out of touch to advertise this?</a>

Multiple lines like that

I only want the stuff that's between the quotes in href="http://xxxxxxxx" and rel="">yyyyyyyyyy, the rest is unnecessary.

Id like them to output like this, a new line for every block above

<a href="http://xxxxxxxx" rel="">yyyyyyyyyy</a>

Any idea how I would get around doing this?

Fruchtzwerg
  • 10,999
  • 12
  • 40
  • 49
pxssy
  • 1
  • 1

2 Answers2

0

So here is a 10s solution. It may be a little brittle but should work assuming the string is in a file called html.txt

cat html.txt | sed 's/class.*href/href/' | sed 's/data-in.*rel=/rel=/'

J

James
  • 1
  • 1
0

Your html example leads me to the following pattern to get the required values:

<a class=\"(.*) href=\"/(.*)\" tabindex=(.*) rel=\"\">(.*)</a>

Replace the matches by using the following pattern:

<a href="http://$2" rel="">$4</a>

You can try it out at regexe for me it works like expected.

Fruchtzwerg
  • 10,999
  • 12
  • 40
  • 49