0

Hope someone can help with this...

I'm using regex in an XSLT to parse an HTML document, I'm looking for a regex which will return text NOT in a valid p tag.

e.g.

I want to find this text
<p>I don't want to find this text</p>
I want to find this text
Jay
  • 115
  • 1
  • 1
  • 8
  • 3
    Is there a reason you can't just use an xpath expression of `text()`? Perhaps it would be worth including a little more context of the document you're transforming, and the template that you've tried – Rowland Shaw Jun 09 '14 at 11:23
  • Are you parsing HTML with RegEx? – Adriano Repetti Jun 09 '14 at 11:24
  • FYI added C# demo. :) – zx81 Jun 09 '14 at 11:41
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) –  May 14 '15 at 15:30

1 Answers1

0

Parsing html with regex is a risky business, even more so in your case because of the potential for nested tags. You probably don't want to do it.

That being said, with all the disclaimers, given the simple samples you provided, you could examine this skeleton solution to see what would and wouldn't work about regex. I hope someone else will give you a Dom parser solution.

This skeleton solution uses this regex:

(?i)<(\w+).*?<\/\1[^>]*>|([a-z][a-z ]+)

Note that this is a skeleton of an solution, because the [a-z][a-z ]+, which matches I want this, will have to be refined to incorporate characters you wish to allow, such as digits, dashes and so forth. It cannot be a plain dot-star, otherwise it will eat up parts of string meant to be nixed by the regex fragment to the left of the |.

How does it work?

This is a situation where you want to exclude some content from being matched—in this case, tags. It is similar to this question about regex-matching a pattern unless...

The left side of the alternation | matches complete <something > tags</something > tags. We will ignore these matches. The right side matches and captures "content" to Group 2, and we know (or hope) it is the right stuff because it was not matched by the expression on the left.

This program shows how to use the regex (see the results at the bottom of the online demo):

using System;
using System.Text.RegularExpressions;
using System.Collections.Specialized;
class Program
{
static void Main()  {
string s1 = @"want to find this text
<p>I don't want to find this text</p>
I want to find this text";
var myRegex = new Regex(@"(?i)<(\w+).*?<\/\1[^>]*>|([a-z][a-z ]+)");
var group1Caps = new StringCollection();

Match matchResult = myRegex.Match(s1);
// put Group 2 captures in a list
while (matchResult.Success) {
   if (matchResult.Groups[2].Value != "") {
        group1Caps.Add(matchResult.Groups[2].Value);
        }
  matchResult = matchResult.NextMatch();
}

Console.WriteLine("\n" + "*** Matches ***");
if (group1Caps.Count > 0) {
   foreach (string match in group1Caps) Console.WriteLine(match);
   }

Console.WriteLine("\nPress Any Key to Exit.");
Console.ReadKey();

} // END Main
} // END Program

Reference

How to match (or replace) a pattern except in situations s1, s2, s3...

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105