I am trying to extract the social media links from websites for my research unfortunately, I am not able to extract them as they are located in the footer of the website.
I tried requests, urllib.request, pattern.web apis to download the html document of a webpage. All these apis download the same content and failing to download the content in the footer of the websites.
import requests
from bs4 import BeautifulSoup as soup
url = 'https://cloudsight.ai/'
headers = {'User-Agent':'Mozilla/5.0'}
sm_sites = ['https://www.twitter.com','https://www.facebook.com',
'https://www.youtube.com','https://www.linkedin.com',
'https://www.linkedin.com/company', 'https://twitter.com',
'https://facebook.com','https://youtube.com','https://linkedin.com',
'http://www.twitter.com','http://www.facebook.com',
'http://www.youtube.com','http://www.linkedin.com',
'http://www.linkedin.com/company', 'http://twitter.com',
'http://facebook.com','http://youtube.com','http://linkedin.com']
blocked = ['embed','search','sharer','intent','share','watch']
sm_sites_present = []
r = requests.get(url,headers=headers)
content = soup(r.content,'html.parser')
text = r.text
links = content.find_all('a',href=True)
for link in links:
a = link.attrs['href'].strip('/')
try:
if any(site in a for site in sm_sites) and not any(block in a for block in blocked):
sm_sites_present.append(a)
except:
sm_sites_present.append(None)
output:
>>> sm_sites_present
>>> []
If you see the website inspect element the social_media information is provided in the footer div DOM.
If you just even try text.find('footer')
the result is -1.
I tried for many hours to figure out how to extract this footer information and I failed.
SO, I kindly request if anyone could help me in solving it.
Note: Even I tried regex, the problem is the when we download the page the footer information is not being downloaded.