1

I have a tsv file with a lot of HTML inside.

I need to replace %20 after last forward slash of href attributes of non .jpg links. I'm trying with Perl on command line, I need help with the regex.

I have tried some regex, this is in the live test (link next below):

<a\ [^>]*href="([^"]+(%20)+)[^\.jpg][^\/]"[^>]?>

It matches only one <a> tag and captures only the last occurrence of %20.

Here a live test with a sample of tsv.

I could have:

<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/34%20-%20kv34%20-%20tomba%20di%20thumtmose%20iii">text</a>

I must match all of %20 after the last forward slash and replace them with -.

or:

<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen.jpg"> <img border="0" src="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen%20ante.jpg" width="80" height="92" alt="09 antichi egizi - Tomba di Tutankhamen" /></a>

I must not match .jpg's href attributes so the last example above need to remain untouched.

I have also tried this one that matches all expected<a> tags but I don't know how to capture only all of %20 after the last slash to subsequently apply the replace:

<a [^>]*href="([^"]+)[^\.jpg][^\/]"[^>]?>

https://regex101.com/r/cS3iB6/2

lizardhr
  • 190
  • 1
  • 13

2 Answers2

2

replace %20 after last forward slash of href attributes of non .jpg links

You can use the following to match:

%20(?=(?:(?!\.jpg">)[^>\/])*>)

And replace with -

See DEMO

karthik manchala
  • 13,492
  • 1
  • 31
  • 55
2

In order to match %20 inside some delimiters, you can also make use of \G operator (see "Where You Left Off: The \G Assertion"):

You can use \G to specify the position just after the previous match.

The regex you can use is

(<a\b[^<]*?|(?<!^)\G)([^\/]*?)%20(?=(?![^\/]*\.jpg">)[^\/"]*">)

Replace with

\1\2-

Here is my demo

In Perl-like notation, that will look like

s/(<a\b[^<]*?|(?<!^)\G)([^\/]*?)%20(?=(?![^\/]*\.jpg">)[^\/"]*">)/\1\2-/g
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563