python re expression confusion

Question

When reading book: web scraping with python, the re expression confused me,

webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)

And a link in usually looks like:

<a href="/view/Afghanistan-1">

My confusion is that:

Since [^>] means no >, why it followed by a +? This + seems useless.
The confusion is that (.*?) , since * means repeat 0 or more times, why it needs ? to repeat * again?

`[^>]+` means "1 or more characters that are not `>`". `*?` is a non-greedy quantifier - see http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532 — khelwood, Oct 12 '16 at 08:15
@khelwood Thanks! `*?` is clear for me now. But is there any difference between "1 characters that are not `>`" and " 1 or more characters that are not `>`" ? — insomnia, Oct 12 '16 at 08:25
Yes, there is a difference between one and more than one. `"x"` is one character that is not `>`. `"xyz"` is more than one characters that are not `>`. — khelwood, Oct 12 '16 at 08:27
[Regex101](https://www.regex101.com). Check your regex explanation here. — xssChauhan, Oct 12 '16 at 08:30

score 0 · Answer 1 · answered Oct 12 '16 at 08:38

0

[>]+ matches any other attributes and their corresponding values inside the tag.
*? matches between zero and unlimited times, as few times as possible, expanding as needed. so it will only capture the text that would before the NEXT ["\']

answered Oct 12 '16 at 08:38

xssChauhan

1 Answers1