0

Expert

I swear I surf a lot of regular expression stuff to study but I find regular expression topic is too difficult to me understand. Any good suggestion for me?

Anyone can explain to me why this <p[^>]*> can simply remove the start <p> or <p attr="">? And what could I do, if I want make it from

<div style="float: left; width: 350px; border: 1px solid #000000;" class="star1">abcdk</div>

to this

<div class="star1">abcdk</div>

Thanks in advance.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
user610983
  • 855
  • 2
  • 9
  • 14
  • Read some tutorials, experiment, and don't parse HTML with regex. ;-) I'd suggest http://perldoc.perl.org/perlretut.html even if you are not into Perl. – Qtax Jul 14 '11 at 02:54
  • Using regex to parse HTML/XML is a losing battle. It's like using a screwdriver to hammer nails. You can do it with a lot of work but it's not a good solution. Please read the first answer to http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jim Garrison Jul 14 '11 at 06:03
  • [The `
    ` cannot hold it is too late](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454).
    – Maxpm Jul 14 '11 at 11:41

2 Answers2

0

Suggestion: Play around with a regex tester to get the hang of what matches what.

Jason's explanation was good, but maybe not in-depth enough for you if you're just starting out with regexes. Let's take <p[^>]*> a piece at a time:

  • < has no special meaning to regex engines, so it means it just matches a single <
  • p is the same: match a single p
  • [ is a special character. Square braces in regex mean "any of these". For instance, the regex [abc] would match any single character that is 'a', 'b', or 'c'.
  • ^ is a special character with two possible meanings. When inside of square brackets, like it is here, it's a simple negation. Whereas [abc] would match any single character mentioned, [^abc] would match any single character not mentioned.
  • > has no special meaning: just match a single >
  • ] should be obvious to you by now: it closes the opening [
  • * is a special character that says match 0 or more of what came before it; that means it could match nothing, or it could match a huge amount of stuff, as long as it all matches what comes before the *.
  • > still no special meaning, like earlier

So we can break <p[^>]*> into three pieces, and we can say that it matches any series of characters:

  • <p: that starts with a literal <p,
  • [^>]*: which is followed by 0 or more characters that are not a >,
  • >: and which ends with a literal >.

Oh, and http://www.regular-expressions.info is one of the best regex guides I've ever found online.

Ryan Stewart
  • 126,015
  • 21
  • 180
  • 199
0
<p[^>]*>

Matches <p with any character except > ending with a >. In this case a <p> tag with or without attributes.

For a <div> tag, you could simply modify the regex above. A simple example being:

<div[^>]* class="star1">

For something more flexible (i.e. class attribute doesn't have to be at the end):

<div[^>]*class="star1"[^>]*>

I'd encourage you to learn more about regular expressions. They are a very powerful tool.

Jason McCreary
  • 71,546
  • 23
  • 135
  • 174