-5
response = 'li><a href="/beautifulsoup/" title="BeautifulSoup"><......'

what I intend to capture is /beautifulsoup/

this is the correct code

link =re.findall(r'href=\"?([^\" ]+)',response)

this is my code

link =re.findall(r'><a\b href=\"? .\"\b',response)

I had three questions:

1) why are square brackets used. they should only be used for some sequence in character

2) why there is no '.' in correct code after question mark

3) why are parenthesis used; they should only be used for grouping but there is no grouping required

  • You are getting downvotes from others because your question is trivially answered by google. However, the good news is you can check out https://regex101.com/ and put in any string and regex you want, and it will return the result plus color-coded explanations of each part. But also, please don't do too much [parsing of html with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – HFBrowning Dec 19 '17 at 02:52
  • BTW, [don't use regular expressions to parse html/xml](https://stackoverflow.com/a/1732454/1477064). Learn how to use DOM and an xml parser. Personally I recommend lxml – xvan Dec 19 '17 at 03:18

1 Answers1

0

1) Square brackets are not for sequences of characters. They match any one character from inside the brackets. [abc] matches either a, b or c. If you use [^...], it will match any one character that is not inside the brackets. [^abc] matches anything that is not a, b or c.

2) The [^\" ] bit essentially replaces your .. It matches anything but " or space. With the + quantifier, it keeps matching (greedily) until there is a quote or space.

3) Since you want to get /beautifulsoup/, grouping is required. With (), beautiful soup is in group 1. Without groups, you will get the whole match, which is href="/beautifulsoup/

Sweeper
  • 213,210
  • 22
  • 193
  • 313