1

I need to parse multiple html through requests.get(). I just need to keep the content of the page and get rid of the embedded javascript and css. I saw the following post but no solution works for me. http://stackoverflow.com/questions/14344476/how-to-strip-entire-html-css-and-js-code-or-tags-from-html-page-in-python, http://stackoverflow.com/questions/1936466/beautifulsoup-grab-visible-webpage-text, and http://stackoverflow.com/questions/2081586/web-scraping-with-python

I got a working code that doesn't strip js either css... here is my code...

count = 1
for link in clean_urls[:2]:
    page = requests.get(link, timeout=5)
    try:
        page = BeautifulSoup(page.content, 'html.parser').text
        webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
        webpage_out.write(clean_page)
        count += 1
    except:
        pass


webpage_out.close()

I tried to include solutions from the links above mention but no code works for me. What line of code can get rid of the embedded js and embedded css

Question Update 4 OCT 2016

The file that read.csv is something like this...

trump,clinton
data science, operating system
windows,linux
diabetes,cancer

I hit gigablast.com with those terms to search one row at the time. One search will be trump clinton. The result is a list of urls. I requests.get(url) and I process those urls getting rid of timeouts, status_code = 400s, and building a clean list of clean_urls = []. After that I fire the following code...

count = 1
for link in clean_urls[:2]:
    page = requests.get(link, timeout=5)
    try:
        page = BeautifulSoup(page.content, 'html.parser').text
        webpage_out = open(my_params['q'] + '_' + str(count) + '.txt', 'w')
        webpage_out.write(clean_page)
        count += 1
    except:
        pass


webpage_out.close()

On this line of code page = BeautifulSoup(page.content, 'html.parser').text I have the text of the entire web page, including styles and scripts if they were embedded. I can't target them with BeautifulSoup because the tags are no longer there. I did try page = BeautifulSoup(page.content, 'html.parser') and find_all('<script>') and try to get rid of the script but I ended up erasing the entire file. The desired outcome will be all the text of the html without any...

body {
    font: something;
}

or any javascript...

$(document).ready(function(){
    $some code
)};

The final file should have no code what so ever, just the content of the document.

redeemefy
  • 4,521
  • 6
  • 36
  • 51

1 Answers1

1

I used this code to get rid of javascript and CSS code while scraping HTML page

import requests
from bs4 import BeautifulSoup

url = 'https://corporate.walmart.com/our-story/our-business'
r = requests.get(url)
html_doc = r.text

soup = BeautifulSoup(html_doc, 'html.parser')
title =  soup.title.string

for script in soup(["script", "style"]):
    script.decompose()    

with open('output_file.txt', "a") as text_file:
    text_file.write("\nURL : "+ url)
    text_file.write("\nTitle : " + title)
    
               
    for p_tag_data in soup.find_all('p'):
        text_file.write("\n"+p_tag_data.text)
        
    for li_tag_data in soup.find_all('li'):
        text_file.write("\n"+li_tag_data.text)
        
    for div_tag_data in soup.find_all('div'):
        text_file.write("\n"+div_tag_data.text)
B. Kanani
  • 606
  • 5
  • 11