Regex capture and replace %20 after last forward slash

Question

I have a tsv file with a lot of HTML inside.

I need to replace %20 after last forward slash of href attributes of non .jpg links. I'm trying with Perl on command line, I need help with the regex.

I have tried some regex, this is in the live test (link next below):

<a\ [^>]*href="([^"]+(%20)+)[^\.jpg][^\/]"[^>]?>

It matches only one <a> tag and captures only the last occurrence of %20.

Here a live test with a sample of tsv.

I could have:

<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/34%20-%20kv34%20-%20tomba%20di%20thumtmose%20iii">text</a>

I must match all of %20 after the last forward slash and replace them with -.

or:

<a href="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen.jpg"> <img border="0" src="http://example.com/path/to-some-folder/another%20folder/one%20more520folder/uploads/2012/02/some%20folder/another%20folder/09%20antichi%20egizi%20-%20Tomba%20di%20Tutankhamen%20ante.jpg" width="80" height="92" alt="09 antichi egizi - Tomba di Tutankhamen" /></a>

I must not match .jpg's href attributes so the last example above need to remain untouched.

I have also tried this one that matches all expected<a> tags but I don't know how to capture only all of %20 after the last slash to subsequently apply the replace:

<a [^>]*href="([^"]+)[^\.jpg][^\/]"[^>]?>

https://regex101.com/r/cS3iB6/2

You need to use an HTML parser to extract only the href attribute for the a tag. [Obligatory link](https://stackoverflow.com/a/1732454/7552) — glenn jackman, Jun 02 '15 at 20:06
Hey karthik - If i learned correctly from your regex yesterday - It could be (?!\.jpg) instead. — Falt4rm, Jun 02 '15 at 20:20
@glenn jackman, thank you for the link, I'm in the case of Kaitlin Duck Sherwood who explain exactly what I need now. — lizardhr, Jun 02 '15 at 21:24

score 2 · Accepted Answer · answered Jun 02 '15 at 20:20

2

replace %20 after last forward slash of href attributes of non .jpg links

You can use the following to match:

%20(?=(?:(?!\.jpg">)[^>\/])*>)

And replace with -

See DEMO

answered Jun 02 '15 at 20:20

karthik manchala

13,492
1
31
55

The basic flaw of this approach is that you do not check if you are in a `` tag, you just check if there is no `.jpg` after the match. What if the extension is `png`? You need to extend the alternative list. The real way to get all the matches inside some markers is `\G` operator. – Wiktor Stribiżew Jun 02 '15 at 21:10
1

@stribizhev You right.. and i did so because we are not validating the pattern.. just replacing from pre existing pattern.. so there is no issue in the assumption i made.. also.. OP wants for `non .jpg` links..for which i think my solution is good enough.. – karthik manchala Jun 02 '15 at 21:16
@stribizhev In this case I have only jpg but would be interesting see how would you do with the \G operator? – lizardhr Jun 02 '15 at 21:17

Wiktor Stribiżew · Answer 2 · 2015-06-02T21:24:51.243

2

In order to match %20 inside some delimiters, you can also make use of \G operator (see "Where You Left Off: The \G Assertion"):

You can use \G to specify the position just after the previous match.

The regex you can use is

(<a\b[^<]*?|(?<!^)\G)([^\/]*?)%20(?=(?![^\/]*\.jpg">)[^\/"]*">)

Replace with

\1\2-

Here is my demo

In Perl-like notation, that will look like

s/(<a\b[^<]*?|(?<!^)\G)([^\/]*?)%20(?=(?![^\/]*\.jpg">)[^\/"]*">)/\1\2-/g

edited Jun 02 '15 at 21:24

answered Jun 02 '15 at 21:08

Wiktor Stribiżew

607,720
39
448
563

I do not think this solution is good with large texts, but it is precise. – Wiktor Stribiżew Jun 02 '15 at 21:28

Regex capture and replace %20 after last forward slash

2 Answers2