1

I would like to replace in text-fragments like:

<strong>Media Event &quot;New Treatment Options on November 4&ndash;5, 2010, in Paris, France<br /></strong><a href="/news/electronic_press_kits/company_media_event_trap_eye.php">&gt;&gt; more</a>  

all underscores with dashes. But only in the href-attribute. As there are hundreds of files the best approach is to work on these files with sed or a small shellscript.

I started with

\shref=\"([^_].+?)([_].+?)\" 

but this matches only 1 _ and i don't know the number of _ and i stucked how dynamically could replace the underscores in a unknown number of back-references.

Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
Boris
  • 43
  • 1
  • 3

2 Answers2

1

Regular expressions are simply fundamentally the wrong tool for this job. There is too much context that must be matched.

Instead, you'll need to write something that goes character-by-character, with two modes: one in which it just copies all input, and one in which it replaces underscore with dash. On finding the start of an href it enters the second mode, on leaving an href it returns to the first. This is essentially a limited form of a tokenizer.

wnoise
  • 9,764
  • 37
  • 47
1

A tool that's specifically geared toward working with HTML is by far preferable since trying to work with it using regexes can lead to madness.

However, assuming that there's only one href per line, you might be able to use this divide-and-conquer technique:

sed 's/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/;:a;s/\(\n.*\)_\(.*\n\)/\1-\2/;ta;s/\n//g' inputfile

Explanation:

  • s/\(.*href="\)\([^"]*\)\(".*\)/\1\n\2\n\3/ - put newlines around the contents of the href
  • :a;s/\(\n[^\n]*\)_\([^\n]*\n\)/\1-\2/;ta - replace the underscores one-by-one in the text between the newlines, t branches to label :a if a substitution was made
  • s/\n//g - remove the newlines added in the first step
Community
  • 1
  • 1
Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439