i am new on python, I work on a fake news detection algorithm, I have a problem extracting the name of the site from url

Question

text = 'https://www.nytimes.com/2017/10/09/us/politics/corkers-blast-at-trump-has-other-republicans-nodding-in-agreement.html?rref=collection%2Fsectioncollection%2Fpolitics\r\n'

test = re.findall(r"^http* com$",text)

result i have :

test = [ ]

The output I am expecting would be like this:

www.nytimes.com

Welcome to Stack Overflow. Don't use the question title for your question description; put that in the description section. Your question title should be a summary of your problem (ideally) stated as a question (refer to [How to Ask](https://stackoverflow.com/help/how-to-ask) for examples and details). Also be sure to include all relevant question tags; as your question is about a non-matching regular expression; including the `regex` tag would have been a good choice to categorize your question further. — Ivo Mori, Jul 28 '20 at 05:27
If you need a more reliable way to split URL to parts it is better to use `urllib.parse` module instead of regular expressions. Check this [answer](https://stackoverflow.com/a/56476496/6682517). — Sergey Shubin, Jul 28 '20 at 08:43

score 1 · Answer 1 · answered Jul 28 '20 at 04:39

1

Your regex pattern is wrong. There shouldn't be any space in the pattern, replace * by .* and avoid anchoring your regex with a $ in the end. Try this

>>> re.findall(r"^http.*?com", text)
['https://www.nytimes.com']

answered Jul 28 '20 at 04:39

Prem Anand

2,469
16
16

score 0 · Answer 2 · answered Jul 28 '20 at 05:02

0

You can also try this :

test = re.findall(r"www.+com",text)

Output :

['www.nytimes.com']

answered Jul 28 '20 at 05:02

Bhargav Desai

941
1
5
17

score 0 · Answer 3 · answered Jul 28 '20 at 13:24

0

This will match http or https and also any type of domain (.gov.us, .com.de, .edu...)

test = re.findall(r"^http.*\:\/\/(.*?)\/",text)

^http = Begning with http

"^http.* = will match for http or https

\:\/\/ = escape ://

(.*?) = match group a.k.a what you want (without the ? will match until the last /)

\/ = first ocurrence of "/"

answered Jul 28 '20 at 13:24

Joao Vitorino

2,976
3
26
55

thanks for your answers – Sarhane Mourad Jul 31 '20 at 21:43

i am new on python, I work on a fake news detection algorithm, I have a problem extracting the name of the site from url

3 Answers3