0

I am analysing a dataset of Tweets for a project and want to create a new feature which gives either a binary value indicating the presence of a URL or a decimal value indicating the number of URLs. I am not very experienced with pandas and up to now I have just been using simple features such as length and containing a few words (see below)

df['LEN']=df.TWEET.str.len()
df['HAS_WORD_CHECK_OUT']=df.TWEET.str.contains('check out')

Since URLs can be in so many different formats (www.website.com, https://www.website.com, website.com, etc.), I can't find a solution on how to create this feature. If anyone knows a way please let me know.

doelie247
  • 124
  • 8

3 Answers3

0

That's the regular expression I used in my project, work also for both of the cases: "www.website.com", "https://www.website.com"

def remove_urls_from_string(s):
    return True if re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", s) != s else False
df['Binary_HAS_WORD_CHECK_OUT'] = df['HAS_WORD_CHECK_OUT'].apply(remove_urls_from_string)
Niv Dudovitch
  • 1,614
  • 7
  • 15
0

So you can use requests to check a valid URL. And online you can find amazing regex people have already made to find URLS! Using those you can both find users and then check them if they're valid!

import re
import requests

tweet = 'everyone, head to my stores at http://www.buystuff.com/shirts and at www.codeman.com'
urls = re.findall(r"\b((?:https?://)?(?:(?:www\.)?(?:[\da-z\.-]+)\.(?:[a-z]{2,6})|(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|(?:(?:[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,7}:|(?:[0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,6})|:(?:(?::[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(?::[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(?:ffff(?::0{1,4}){0,1}:){0,1}(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])|(?:[0-9a-fA-F]{1,4}:){1,4}:(?:(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(?:25[0-5]|(?:2[0-4]|1{0,1}[0-9]){0,1}[0-9])))(?::[0-9]{1,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?:/[\w\.-]*)*/?)\b", tweet)

for i in urls:
    if i.startswith('http') != True:
        i = 'https://'+i
    try:
        response = requests.get(i)
        print(f"{i} is valid and exists on the internet")
    except requests.ConnectionError as exception:
        print(f"{i} does not exist on Internet")
0

You can use regular expressions combined with .str.contains to detect patterns in strings as well as specific words.

import pandas as pd
example = pd.DataFrame({'TWEET':['hello',
                             'I visited www.google.com', 
                             'goodbye', 
                             'I like https://www.bbc.co.uk', 
                             'I like wwwhales','checkout imdb.edu']})

I have taken the url regex from here, it works for my test cases above but you might want slightly different classifications, in which case search for something different. You can also test the regexes here.

url_regex = r'[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)'

example.TWEET.str.contains(url_regex,regex=True)
# [False, True, False, True, False, True]
oli5679
  • 1,709
  • 1
  • 22
  • 34