1

I have the following string I am trying to search for:

<td></td>
<td>)</td>

There can be any number of spaces between the </td> and <td> besides the newline. There will always be only one newline but an expression that simply ignores all whitespace (including newlines) is fine.

I'm trying to figure out how to perform a string replacement using this information without collapsing all of the whitespace in the file. I found many solutions with an expression that handles whitespace but nothing that I have been able to make work with a newline as well.

My regex experience is limited. How should I approach this problem from a bash shell environment?

devnull
  • 118,548
  • 33
  • 236
  • 227
Zhro
  • 2,546
  • 2
  • 29
  • 39
  • use dom [ and xpath ] to manipulate html documents – hek2mgl Oct 23 '13 at 10:41
  • You haven't said exactly what you're trying to replace with what. If you read up on regex in, for example, `sed` or `awk`, you'll find options for recognizing newlines. – lurker Oct 23 '13 at 10:46
  • Please show example output (after replacement) – Bohemian Oct 23 '13 at 10:51
  • 1
    Then there's this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – lurker Oct 23 '13 at 10:59
  • What do you want to replace in the outfile? Something within the tags or those two lines? – beroe Oct 23 '13 at 23:31

2 Answers2

2

If I understand you correctly, you're trying to match an empty <td> tag pair followed by a newline and a <td> tag pair with a single closing parenthesis in it (with any amount of spaces after the first </td> and/or before the second <td>). If that's correct try the following expression:

<td></td> *\n *<td>)</td>

Beware that sed normally doesn't support multiline matches, so you need to work with labels and append the next line to the current line before doing the substitution (see here for a full explanation):

sed ':a;N;$!ba;s|<td></td> *\n *<td>)</td>|...|g' infile >outfile

Replace the ellipsis (...) with your actual replacement text.

Community
  • 1
  • 1
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
  • 1
    This works perfectly. I had to edit a typo (!$ should be $!); still awaiting peer review. – Zhro Oct 24 '13 at 09:49
0
<td>\s*?\)?\s*?</td>

This will match a with an optional ) and any number of whitespace between the tags. I'm unsure though, whether it is the string you're actually looking for?

However, the gist of it is using \s as the character class for whitespaces, inclusive newline.

Allan S. Hansen
  • 4,013
  • 23
  • 25
  • No need to use reluctant quantifiers IMO. – RokL Oct 23 '13 at 10:49
  • Possible, but as far as I can see there's no harm in having them in this expression either. – Allan S. Hansen Oct 23 '13 at 10:51
  • It depends on what you're matching but they can be slower. If taking a shorter match is not preferred or even possible, then it's best to stick with greedy quantifiers. In this case `\\s*?\)` the shortest and the longest possible match are always the same and reluctant quantifier only results in lots of backtracking. – RokL Oct 23 '13 at 11:11