0

I have a bunch of html I am parsing and I need to remove certain <a> tags if they contain certain text. Normally, I'd use Goquery BUT the text I am searching for often falls outside the html tag itself. For instance, this html:

<html><body>
This is the start.            
<a href="http://example.com/path">We don't want to match this text.</a>
<a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"></a> match this text.<a href="blah">We also don't want to match this text</a>
</body></html>

I am using this regexp but it is failing and matching the text I don't want to match:

(?is)<a[^>]+href=["'](?P<link>.*?)["']*.?> match this text\.

https://regex101.com/r/iEXpqc/1

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
user_78361084
  • 3,538
  • 22
  • 85
  • 147
  • 3
    `.` matches any char. Actually, you still should consider some HTML parser. If you want to use a regex, you should think of some workarounds with negated character classes, see [an example](https://regex101.com/r/iEXpqc/2). – Wiktor Stribiżew Dec 14 '19 at 22:07
  • Yeah, I'm thinking the same but couldn't figure it out with Goquery. The example posted matches the wrong text, btw. – user_78361084 Dec 14 '19 at 22:11
  • Yeah, it is not quite clear anyway what the criteria for a match are. – Wiktor Stribiżew Dec 14 '19 at 22:12
  • 1
    Have you considered an XPath package? XPath can be a little horrific but it does support looking inside text nodes. – mu is too short Dec 14 '19 at 22:22
  • See: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – leaf bebop Dec 15 '19 at 03:36

1 Answers1

0

Like this, using (not , but the logic can be re-implemented):

xmlstarlet ed -d '//a[contains(text(), "want to match")]' file.html

 Output

<?xml version="1.0"?>
<html>
  <body>
This is the start.  

<a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"/> match this text.
</body>
</html>

 Note

  • add -L switch if you want to replace on the fly
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223