Detect presence of a URL in pandas

Question

I am analysing a dataset of Tweets for a project and want to create a new feature which gives either a binary value indicating the presence of a URL or a decimal value indicating the number of URLs. I am not very experienced with pandas and up to now I have just been using simple features such as length and containing a few words (see below)

df['LEN']=df.TWEET.str.len()
df['HAS_WORD_CHECK_OUT']=df.TWEET.str.contains('check out')

Since URLs can be in so many different formats (www.website.com, https://www.website.com, website.com, etc.), I can't find a solution on how to create this feature. If anyone knows a way please let me know.

Niv Dudovitch · Answer 1 · 2021-09-19T12:07:44.787

0

That's the regular expression I used in my project, work also for both of the cases: "www.website.com", "https://www.website.com"

def remove_urls_from_string(s):
    return True if re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", s) != s else False
df['Binary_HAS_WORD_CHECK_OUT'] = df['HAS_WORD_CHECK_OUT'].apply(remove_urls_from_string)

edited Sep 19 '21 at 12:07

answered Sep 19 '21 at 11:56

Niv Dudovitch

1,614
7
15

I think this one only works for links with format https://www.website.com(/subdir). I'm trying to make it work for all types of formats – doelie247 Sep 19 '21 at 12:06
For which format this doesn't work? – Niv Dudovitch Sep 19 '21 at 12:08
There are a lot of questions&anwers on this subject, try to find one that fits your case. – Niv Dudovitch Sep 19 '21 at 12:15

score 0 · Answer 2 · answered Sep 19 '21 at 12:16

So you can use requests to check a valid URL. And online you can find amazing regex people have already made to find URLS! Using those you can both find users and then check them if they're valid!

import re
import requests

tweet = 'everyone, head to my stores at http://www.buystuff.com/shirts and at www.codeman.com'
urls = re.findall(r"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b", tweet)

for i in urls:
    if i.startswith('http') != True:
        i = 'https://'+i
    try:
        response = requests.get(i)
        print(f"{i} is valid and exists on the internet")
    except requests.ConnectionError as exception:
        print(f"{i} does not exist on Internet")

score 0 · Answer 3 · answered Sep 19 '21 at 12:16

You can use regular expressions combined with .str.contains to detect patterns in strings as well as specific words.

import pandas as pd
example = pd.DataFrame({'TWEET':['hello',
                             'I visited www.google.com', 
                             'goodbye', 
                             'I like https://www.bbc.co.uk', 
                             'I like wwwhales','checkout imdb.edu']})

I have taken the url regex from here, it works for my test cases above but you might want slightly different classifications, in which case search for something different. You can also test the regexes here.

url_regex = r'[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

example.TWEET.str.contains(url_regex,regex=True)
# [False, True, False, True, False, True]

Detect presence of a URL in pandas

3 Answers3