-1

When reading book: web scraping with python, the re expression confused me,

webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']', re.IGNORECASE)

And a link in usually looks like:

<a href="/view/Afghanistan-1">

My confusion is that:

  1. Since [^>] means no >, why it followed by a +? This + seems useless.

  2. The confusion is that (.*?) , since * means repeat 0 or more times, why it needs ? to repeat * again?

tripleee
  • 175,061
  • 34
  • 275
  • 318
insomnia
  • 191
  • 2
  • 12
  • 3
    `[^>]+` means "1 or more characters that are not `>`". `*?` is a non-greedy quantifier - see http://stackoverflow.com/questions/3075130/difference-between-and-for-regex/3075532#3075532 – khelwood Oct 12 '16 at 08:15
  • @khelwood Thanks! `*?` is clear for me now. But is there any difference between "1 characters that are not `>`" and " 1 or more characters that are not `>`" ? – insomnia Oct 12 '16 at 08:25
  • Yes, there is a difference between one and more than one. `"x"` is one character that is not `>`. `"xyz"` is more than one characters that are not `>`. – khelwood Oct 12 '16 at 08:27
  • [Regex101](https://www.regex101.com). Check your regex explanation here. – xssChauhan Oct 12 '16 at 08:30
  • @khelwood You are right! Get it – insomnia Oct 12 '16 at 08:32
  • @Regex101 Thanks! Nice tool – insomnia Oct 12 '16 at 08:33

1 Answers1

0
  1. [>]+ matches any other attributes and their corresponding values inside the tag.

  2. *? matches between zero and unlimited times, as few times as possible, expanding as needed. so it will only capture the text that would before the NEXT ["\']

xssChauhan
  • 2,728
  • 2
  • 25
  • 36