0

text:

<span id="p_code_">WHATIWANT</span>

code:

objRegExp.IgnoreCase = True
objRegExp.Global = True
objRegExp.Pattern = "\<(span\s+id=""(p_code_.*)[^\>]+)</span>"

trying to extract string WHATIWANT

user566029
  • 45
  • 1
  • 4
  • 2
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – oluies May 13 '11 at 22:16

2 Answers2

2

Don't parse (x)html with regex! That's what the DOM is for.

http://www.uv.tietgen.dk/staff/mlha/pc/web/script/vbscript/object/index.htm

bluepnume
  • 16,460
  • 8
  • 38
  • 48
  • I agree this would make sense unfortunately the id for the tag is dynamic hence the need for a regular expression – user566029 May 14 '11 at 04:29
2

I think what you're looking for is the following:

objRegExp.Pattern = "\<span id=\"p_code_\"\>(.*?)\<\/span\>"

It's sometimes helpful to use something to test against your regex/string. I mostly just use TextMate's find function for this purpose, but here's a great web resource: http://rubular.com/

EDIT: based on the comment below, it looks like you need something more like:

objRegExp.Pattern = "\<span id=\"p_code_d\d{3,}a\d{3,}\"\>(.*?)\<\/span\>"

to capture the "d567a356" part of the span's id. This assumed that the id will always end with something of the form: d(followed by three or more numbers)a(followed by three or more numbers).

EDIT 2:

Actually, this is more general:

objRegExp.Pattern = "\<span id=\"p_code_.+?\b\"\>(.*?)\<\/span\>"

This will match both of the following:

<span id="p_code_d567a356" class="blaf">WHATIWANT</span>

and

<span id="p_code_d567a3dsfasfdsaf56">WHATIWANT</span>
Gavin Anderegg
  • 6,281
  • 2
  • 25
  • 35
  • Your `(.*)` is greedy, that means the following closing tag will match the last one it will find anything else will be matched by the `.*`, i.e. in group 1 can be much more than expected. – stema May 13 '11 at 22:24
  • 1
    Sorry, you're right. I was assuming that because it was an id, it would appear exactly once in a document... but that doesn't mean it would be the only span in the document as well. – Gavin Anderegg May 13 '11 at 22:27
  • I've added a `?` to ungreedify things. – Gavin Anderegg May 13 '11 at 22:32
  • WHATISTHIS is the full tag and the d567a356 is a dynamic value. The regex doesn't seem to be working with vbscript – user566029 May 14 '11 at 04:34
  • I've added an edit that addresses the "d567a356" part of the id. You should add more information about the form of that id, or else it's really hard to accurately answer the question. Is it always d(followed by three or more numbers)a(followed by three or more numbers)? – Gavin Anderegg May 14 '11 at 14:53