Extract urls from the footer of a web page

Question

I am trying to extract the social media links from websites for my research unfortunately, I am not able to extract them as they are located in the footer of the website.

I tried requests, urllib.request, pattern.web apis to download the html document of a webpage. All these apis download the same content and failing to download the content in the footer of the websites.

import requests
from bs4 import BeautifulSoup as soup 
url = 'https://cloudsight.ai/'
headers = {'User-Agent':'Mozilla/5.0'}
sm_sites = ['https://www.twitter.com','https://www.facebook.com',
                'https://www.youtube.com','https://www.linkedin.com',
                'https://www.linkedin.com/company', 'https://twitter.com',
          'https://facebook.com','https://youtube.com','https://linkedin.com',
                'http://www.twitter.com','http://www.facebook.com',
                'http://www.youtube.com','http://www.linkedin.com',
                'http://www.linkedin.com/company', 'http://twitter.com',
             'http://facebook.com','http://youtube.com','http://linkedin.com']

blocked = ['embed','search','sharer','intent','share','watch']

sm_sites_present = []

r = requests.get(url,headers=headers)
content = soup(r.content,'html.parser')
text = r.text

links = content.find_all('a',href=True)
for link in links:
    a = link.attrs['href'].strip('/')
    try:
        if any(site in a for site in sm_sites) and not any(block in a for block in blocked): 
            sm_sites_present.append(a)
    except:
        sm_sites_present.append(None)

output:
>>> sm_sites_present
>>> []

If you see the website inspect element the social_media information is provided in the footer div DOM.

If you just even try text.find('footer') the result is -1.

I tried for many hours to figure out how to extract this footer information and I failed.

SO, I kindly request if anyone could help me in solving it.

Note: Even I tried regex, the problem is the when we download the page the footer information is not being downloaded.

This is not working because `'https://www.twitter.com' in 'https://twitter.com/CloudSightAPI'` is returning `False`. So, remove your `if` statement inside a `try` one. Capture everything and then probably refine whatever you want — Rahul Agarwal, May 17 '19 at 10:21
The content is added dynamically via JavaScript, so you won't find it in the HTML source you get. — NineBerry, May 17 '19 at 10:22
Possible duplicate of [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) — NineBerry, May 17 '19 at 10:26
As NineBerry stated, its generated by js. You'll need to use Selenium or [HTTP-requests](https://html.python-requests.org/) (which is javascript supported) to have the page render first. — chitown88, May 17 '19 at 12:19
Thank you guys for your answers. I shall try and update you soon — Uday Simha, May 18 '19 at 07:56

score 0 · Answer 1 · answered May 17 '19 at 15:36

0

As suggested by @chitown88, you can use Selenium to get the content.

from selenium import webdriver

url = 'https://cloudsight.ai/'

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source

driver.quit()

soup = BeautifulSoup(html,'html.parser')
[i.a['href'] for i in soup.footer.find_all('li', {'class':'social-list__item'})]

output

['https://www.linkedin.com/company/cloudsight-inc',
 'https://www.facebook.com/CloudSight',
 'https://twitter.com/CloudSightAPI']

answered May 17 '19 at 15:36

sentence

8,213
4
31
40

Than you I shall try try it – Uday Simha May 18 '19 at 07:57
@UdaySimha Please, accept my answer if it satisfies your question. – sentence May 18 '19 at 08:43

Extract urls from the footer of a web page

1 Answers1