1
import regex
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
x = regex.findall(r"/((http[s]?:\/\/)?(www\.)?(gamivo\.com\S*){1})", frase) 
print(x)

Result:

[('www.gamivo.com/product/sea-of-thieves-pc-xbox-one', '', 'www.', 'gamivo.com/product/sea-of-thieves-pc-xbox-one'), ('www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr', '', 'www.', 'gamivo.com/product/fifa-21-origin-eng-pl-cz-tr')]

I want something like:

[('https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://gamivo.com/product/fifa-21-origin-eng-pl-cz-tr')]

How can I do this?

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
Diego
  • 25
  • 2
  • Remove the first `/` and use non-capturing groups. `r'(?:https?://)?(?:www\.)?gamivo\.com\S*'`, see [this demo](https://regex101.com/r/phCIEr/1). – Wiktor Stribiżew Jul 23 '21 at 09:16
  • do u really need regex for this ? split on spaces and take the ones with https in the resulting array – leoOrion Jul 23 '21 at 09:17
  • @leoOrion yes it's for a more bigger project that needs a regex. So in final project I will replace with str.replace() to use a shorted link – Diego Jul 23 '21 at 09:22

2 Answers2

1

You need to

  1. Remove the initial / char that invalidates the match of https:// / http:// since / appears after http
  2. Remove unnecessary capturing group and {1} quantifier
  3. Convert the optional capturing group into a non-capturing one.

See this Python demo:

import re
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
print( re.findall(r"(?:https?://)?(?:www\.)?gamivo\.com\S*", frase) )
# => ['https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr']

See the regex demo, too. Also, see the related re.findall behaves weird post.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Try this, it will take string starting from https to single space or newline.

import re
frase = "text https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one other text https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr"
x = re.findall('(https?://(?:[^\s]*))', frase)
print(x)
# ['https://www.gamivo.com/product/sea-of-thieves-pc-xbox-one', 'https://www.gamivo.com/product/fifa-21-origin-eng-pl-cz-tr']
Pausi
  • 134
  • 2
  • 7