0

I want to use REGEX to target the CITE tag in two different ways, depending on if it is in a URL or not:

  • Either remove the tag if it is in a URL, starting with www or http(s);
  • Leave the tag in tact if it is not in an URL

This is the string I want to operate on:

www.os<cite>map</cite>s.ordnancesurvey.co.uk/os<cite>map</cite>s/
and the normal text <cite>map</cite> here and again <cite>map</cite> here
http://os<cite>map</cite>s.ordnancesurvey.co.uk/os<cite>map</cite>s/
and the normal text <cite>map</cite> here and again <cite>map</cite> here

I have been using this expression:

$this_record = preg_replace('/((www.)|(https?:\/\/))([^\s]*?)(<cite>([^\s]*?)<\/cite>)([^\s]*)/', '$2$3$4$6$7', $this_record);

This works, but only for the the FIRST set of tags and results in:

www.osmaps.ordnancesurvey.co.uk/os<cite>map</cite>s/
and the normal text <cite>map</cite> here and again <cite>map</cite> here
http://osmaps.ordnancesurvey.co.uk/os<cite>map</cite>s/
and the normal text <cite>map</cite> here and again <cite>map</cite> here

Only the first set of tags are removed in the URLs. How would I remove subsequent ones?

Many thanks

user884899
  • 81
  • 1
  • 6
  • 1
    Your regular expression asks for `http` or `www` to start the string, but the second match in the same url starts after the end of the first match, where it does not find that starting token. You might just want to run it multiple times, until the string is no longer modified, to catch all instances within a given URL. – joanis Aug 20 '20 at 18:11
  • Thanks for that! Yes I was beginning to think that was the only way to do it – user884899 Aug 20 '20 at 19:22

0 Answers0