0

Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*) lift 2346 21st Ave NE from the following text (spaced exactly like that), in Rubular.

<span class='street-address'>
2346 21st Ave NE
</span>

Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.

How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?

exlo
  • 315
  • 1
  • 8
  • 20
  • You shouldn't change your question after an answer is already answered to your question, since it generates confusions and deprecated answers. You should create a new question according to your newer needs. – Federico Piazza May 11 '15 at 02:09
  • I see, I'll create a new question. Thank you @Fede. – exlo May 11 '15 at 02:10
  • no problem, I'll help too :) – Federico Piazza May 11 '15 at 02:12
  • @Fede, you're a freakin champion. I'll delete this comment later since it's a non-question/non-info comment, but is there anyway I can add to your rep? I'm new to StackOverflow and coding, so I'm as dumb as a brick. – exlo May 11 '15 at 02:20
  • lol, I'm glad to help. What do you mean by add to my rep? – Federico Piazza May 11 '15 at 02:22
  • Hm, in other forums I can either comment, upvote your profile, or give you points. Is the best way to select best answer in StackOverflow? – exlo May 11 '15 at 02:25
  • you can offer bounties or upvote answers if you feel you consider it right. – Federico Piazza May 11 '15 at 02:27

1 Answers1

2

As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.

You have just to use the s (single line flag) and also use a lazy quantifier

<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:

(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:

\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:

[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
  • I see, but why are the backslashes before `s` not needed? The new regex is not solving the issue of line breaks in Rubular though. According to http://qntm.org/files/re/re.html, line breaks can be address `(^?.*?$)<\/span>`, with `^` and `$`, but it's not returning any matches. – exlo May 11 '15 at 01:44
  • @exlo The anchors your meant `^` and `$` are used to match start and end of line repectively. They are commonly used with `m` (multiline flag). Btw, didn't understand `why are the backslashes before s not needed`... – Federico Piazza May 11 '15 at 01:47
  • If you don't mind me asking, how does including an opposite set of non-whitespace and whitespace characters work? (Ah, I misunderstood the s line flag, so I was asking about the backslashes, the backslashes were in the context of the set you mentioned afterward since I'd come across it in my research). Also, I couldn't find the single-line flag in your first example. – exlo May 11 '15 at 01:50
  • @exlo The `s` flag is in the second text box. About the trick, I updated the answer with the explanation. Keep in mind that `\s` and `\S` are in a character class `[\s\S]` so that's how you can use the trick. Think a character class as a set, for instance `[123]` will match 1, 2 or 3... so if you have `[\s\S]` it will match whitespace or non whitespace... same happens for `[\w\W]` (word or non word) or for `[\d\D]` (digit or non digit) – Federico Piazza May 11 '15 at 01:56
  • @Fede is giving you excellent advice - use a parser. Must read on this topic (and for life in general): http://stackoverflow.com/a/1732454/505191 – thekbb May 11 '15 at 02:11