I have a tsv file with a lot of HTML inside.
I need to replace %20
after last forward slash of href attributes of non .jpg links.
I'm trying with Perl on command line, I need help with the regex.
I have tried some regex, this is in the live test (link next below):
<a\ [^>]*href="([^"]+(%20)+)[^\.jpg][^\/]"[^>]?>
It matches only one <a>
tag and captures only the last occurrence of %20
.
Here a live test with a sample of tsv.
I could have:
<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/34%20-%20kv34%20-%20tomba%20di%20thumtmose%20iii">text</a>
I must match all of %20
after the last forward slash and replace them with -
.
or:
<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen.jpg"> <img border="0" src="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen%20ante.jpg" width="80" height="92" alt="09 antichi egizi - Tomba di Tutankhamen" /></a>
I must not match .jpg's href attributes so the last example above need to remain untouched.
I have also tried this one that matches all expected<a>
tags but I don't know how to capture only all of %20
after the last slash to subsequently apply the replace:
<a [^>]*href="([^"]+)[^\.jpg][^\/]"[^>]?>