0

I am trying to extract the social media links from websites for my research unfortunately, I am not able to extract them as they are located in the footer of the website.

I tried requests, urllib.request, pattern.web apis to download the html document of a webpage. All these apis download the same content and failing to download the content in the footer of the websites.

import requests
from bs4 import BeautifulSoup as soup 
url = 'https://cloudsight.ai/'
headers = {'User-Agent':'Mozilla/5.0'}
sm_sites = ['https://www.twitter.com','https://www.facebook.com',
                'https://www.youtube.com','https://www.linkedin.com',
                'https://www.linkedin.com/company', 'https://twitter.com',
          'https://facebook.com','https://youtube.com','https://linkedin.com',
                'http://www.twitter.com','http://www.facebook.com',
                'http://www.youtube.com','http://www.linkedin.com',
                'http://www.linkedin.com/company', 'http://twitter.com',
             'http://facebook.com','http://youtube.com','http://linkedin.com']

blocked = ['embed','search','sharer','intent','share','watch']

sm_sites_present = []

r = requests.get(url,headers=headers)
content = soup(r.content,'html.parser')
text = r.text

links = content.find_all('a',href=True)
for link in links:
    a = link.attrs['href'].strip('/')
    try:
        if any(site in a for site in sm_sites) and not any(block in a for block in blocked): 
            sm_sites_present.append(a)
    except:
        sm_sites_present.append(None)

output:
>>> sm_sites_present
>>> []

If you see the website inspect element the social_media information is provided in the footer div DOM.

If you just even try text.find('footer') the result is -1.

I tried for many hours to figure out how to extract this footer information and I failed.

SO, I kindly request if anyone could help me in solving it.

Note: Even I tried regex, the problem is the when we download the page the footer information is not being downloaded.

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
Uday Simha
  • 37
  • 1
  • 5
  • This is not working because `'https://www.twitter.com' in 'https://twitter.com/CloudSightAPI'` is returning `False`. So, remove your `if` statement inside a `try` one. Capture everything and then probably refine whatever you want – Rahul Agarwal May 17 '19 at 10:21
  • 2
    The content is added dynamically via JavaScript, so you won't find it in the HTML source you get. – NineBerry May 17 '19 at 10:22
  • Possible duplicate of [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) – NineBerry May 17 '19 at 10:26
  • As NineBerry stated, its generated by js. You'll need to use Selenium or [HTTP-requests](https://html.python-requests.org/) (which is javascript supported) to have the page render first. – chitown88 May 17 '19 at 12:19
  • Thank you guys for your answers. I shall try and update you soon – Uday Simha May 18 '19 at 07:56

1 Answers1

0

As suggested by @chitown88, you can use Selenium to get the content.

from selenium import webdriver

url = 'https://cloudsight.ai/'

driver = webdriver.Firefox()
driver.get(url)

html = driver.page_source

driver.quit()

soup = BeautifulSoup(html,'html.parser')
[i.a['href'] for i in soup.footer.find_all('li', {'class':'social-list__item'})]

output

['https://www.linkedin.com/company/cloudsight-inc',
 'https://www.facebook.com/CloudSight',
 'https://twitter.com/CloudSightAPI']
sentence
  • 8,213
  • 4
  • 31
  • 40