0

I have a a previously matched pattern such as:

<a href="somelink here something">

Now I wish to extract only the value of a specific attribute(s) in the tag such but this may be anything an occur anywhere in the tag.

regex_pattern=re.compile('href=\"(.*?)\"') 

Now I can use the above to match the attribute and the value part but I need to extract only the (.*?) part. (Value)

I can ofcourse strip href=" and " later but I'm sure I can use regex properly to extract only the required part.

In simple words I want to match

abcdef=\"______________________\"

in the pattern but want only the

____________________

Part

How do I do this?

ffledgling
  • 11,502
  • 8
  • 47
  • 69
  • 1
    I'll post the obligatory link to [this](http://stackoverflow.com/a/1732454/566644) StackOverflow classic. Are you sure you don't want to use a proper XML parser? – Lauritz V. Thaulow Jul 27 '12 at 09:09
  • 1
    Though that link is fun, I see no harm in matching specific attributes with a regular expression. Do be aware of the limitations of regular expressions when it comes to dealing with XML and HTML (which the linked post so entertainingly expands upon); use `lxml` or `BeautifulSoup` instead. – Martijn Pieters Jul 27 '12 at 09:22
  • I'll agree HTML is a bit of a hassle if you want parse anykind(abstract) form of HTML in general. I.e if you try to create an all purpose parser or something but for fixed format pages(scraping purposes :P ) it works though I still need a little help from the outside. – ffledgling Jul 27 '12 at 23:24

2 Answers2

2

Just use re.search('href=\"(.*?)\"', yourtext).group(1) on the matched string yourtext and it will yield the matched group.

Ben Ruijl
  • 4,973
  • 3
  • 31
  • 44
1

Take a look at the .group() method on regular expression MatchObject results.

Your regular expression has an explicit group match group (the part in () parethesis), and the .group() method gives you direct access to the string that was matched within that group. MatchObject are returned by several re functions and methods, including the .search() and .finditer() functions.

Demonstration:

>>> import re
>>> example = '<a href="somelink here something">'
>>> regex_pattern=re.compile('href=\"(.*?)\"') 
>>> regex_pattern.search(example)
<_sre.SRE_Match object at 0x1098a2b70>
>>> regex_pattern.search(example).group(1)
'somelink here something'

From the Regular Expression syntax documentation on the (...) parenthesis syntax:

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: [(] [)].

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343