2

I am trying to extract multiple domain names that end in .com either starting with https or http from a string.

The string is:

string="jssbhshhahttps://www.one.comsbshhshshttp://www.another.comhehsbwkwkwjhttp://www.again.co.uksbsbs"

I have created the pattern as follows:

pattern=re.compile("https?://")

I am not sure how to finish it off.

I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains in the output.

I have tried using (.*) in the middle to represent unlimited combinations of characters but now sure how to finish it off.

Any help would be much appreciated and it would be great if all parts of the expression could be explained.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Fred Smith
  • 33
  • 4
  • Similar questions have been asked here before, and a pattern like `pattern=re.compile(r"https?://\S*?(?=https?://|$|\s)")` is usually suggested. The only problem here is if you need to stop right after TLD. Then, you either need a list of TLDs you need to support (e.g. `https?://\S*?\.(?:com|co\.uk)` for the ones in question), or you are stuck. – Wiktor Stribiżew Feb 15 '21 at 12:16
  • That is what I have not been able to resolve in the other questions: how to stop after the .com and continue looking for others. Would be great if the expression could be explained also – Fred Smith Feb 15 '21 at 12:22
  • Does it mean you only want to stop after the first `.com` or `.co.uk` only? Do you only want to support these two TLDs? – Wiktor Stribiżew Feb 15 '21 at 12:25
  • If you want to be able to handle any URL in the known universe, you can look at https://stackoverflow.com/a/190405/5987669 for the technically correct solution (Assuming it is compatible with python). – Locke Feb 15 '21 at 12:28
  • Hi, thanks for all the responses. I would like to return a list of all domains that start with http or Https and end in .com only. So no .co.uk domains. – Fred Smith Feb 15 '21 at 12:31
  • Thanks @Locke but it needs to be a little more specific – Fred Smith Feb 15 '21 at 12:31
  • So, you are after `https?://(?:(?!https?://)\S)*?\.com`? https://regex101.com/r/LVt2OX/1? – Wiktor Stribiżew Feb 15 '21 at 12:44
  • Thank you wiktor that does what I need however I need to explain it and I am a bit confused what the bits between the Https?:// And the .com do. Would you be able to explain it for me? – Fred Smith Feb 15 '21 at 12:47
  • Please check the answer and you may also study the explanation at the regex101 link. – Wiktor Stribiżew Feb 15 '21 at 12:48

1 Answers1

1

You can use

https?://(?:(?!https?://)\S)*?\.com

See the regex demo. You may use a case insensitive modifier re.I or add (?i) inline flag to make the regex case insensitive.

Details

  • https?:// - http:// or https://
  • (?:(?!https?://)\S)*? - any non-whitespace char, zero or more but as few as possible occurrences, not starting a http:// or https:// char sequence (this regex construct is known under a "tempered greedy token" name)
  • \.com - a .com string.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563