0

Code:

text2=re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)

Output:

['https://m.facebook.com/people/Vick-Arcadia/100009629167118/', 'https://m.facebook.com<span', 'https://m.facebook.com<span',
MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77

2 Answers2

1

In general regexes aren't powerful enough to handle handle HTML which is tree structured and has matching openers and closers.

The preferred technique is to use a parser designed for HTML. In the Python world, lxml and BeautifulSoup are popular choices.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

This regex should work better

'https?:\/\/[\w\.]+(\/[\/\w-]+)?'

For regex I recommends testing on https://regex101.com/

But in operation on html better use BeautifulSoup lib, if u add more detals I can help u with this.

Moonar
  • 141
  • 6