Questions tagged [beautifulsoup]

Beautiful Soup is a Python package for parsing HTML/XML. The latest version of this package is version 4, imported as bs4.

Beautiful Soup is a Python library for parsing HTML and XML files, which is useful in web scraping. It can use Python's standard HTML parser as well as other parsers such as lxml or html5lib. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Beautiful Soup 4 (commonly known as bs4, after the name of its Python module) is the latest version of Beautiful Soup, and is mostly backwards-compatible with Beautiful Soup 3. Beautiful Soup is published under MIT License.

From version 4.7.0, Beautiful Soup supports wide range of CSS4 selectors, adding to already rich collection of tools to select HTML/XML elements. You can read about wide range of CSS selectors and pseudo-classes here (soupsieve library - used by bs4).

To install the latest version with pip use pip install beautifulsoup4. And the library is imported in the project like this: from bs4 import BeautifulSoup

Notice: Beautiful Soup 3 works only on Python 2.x while Beautiful Soup 4 works on both Python 2 (2.7+) and Python 3

32305 questions
1488
votes
34 answers

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)

I'm having problems dealing with unicode characters from text fetched from different web pages (on different sites). I am using BeautifulSoup. The problem is that the error is not always reproducible; it sometimes works with some pages, and…
Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341
646
votes
19 answers

How to find elements by class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup. The code looks like this soup = BeautifulSoup(sdata) mydivs = soup.findAll('div') for div in mydivs: if (div["class"] == "stylelistrow"): print div I…
Neo
  • 13,179
  • 18
  • 55
  • 80
511
votes
12 answers

UnicodeEncodeError: 'charmap' codec can't encode characters

I'm trying to scrape a website, but it gives me an error. I'm using the following code: import urllib.request from bs4 import BeautifulSoup get = urllib.request.urlopen("https://www.website.com/") html = get.read() soup = BeautifulSoup(html) And…
SstrykerR
  • 7,982
  • 3
  • 12
  • 11
424
votes
21 answers

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

... soup = BeautifulSoup(html, "lxml") File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to…
user3773048
  • 5,839
  • 4
  • 18
  • 22
356
votes
16 answers

How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into…
zhuyxn
  • 6,671
  • 9
  • 38
  • 44
335
votes
1 answer

BeautifulSoup getting href

I have the following soup: next ... From this I want to extract the href, "some_url" I can do it if I only have one tag, but here there are two tags. I can also get the text 'next' but that's not…
dkgirl
  • 4,489
  • 7
  • 24
  • 26
274
votes
26 answers

Scraping: SSL: CERTIFICATE_VERIFY_FAILED error for http://en.wikipedia.org

I'm practicing the code from 'Web Scraping with Python', and I keep having this certificate problem: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html =…
Catherine4j
  • 2,772
  • 2
  • 8
  • 10
226
votes
5 answers

TypeError: a bytes-like object is required, not 'str' in python and CSV

TypeError: a bytes-like object is required, not 'str' I'm getting the above error while executing the below python code to save the HTML table data in a CSV file. How do I get rid of that error? import csv import requests from bs4 import…
ShivaGuntuku
  • 5,274
  • 6
  • 25
  • 37
213
votes
13 answers

Beautiful Soup and extracting a div and its contents by ID

soup.find("tagName", { "id" : "articlebody" }) Why does this NOT return the
...
tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from…
Tony Stark
  • 24,588
  • 41
  • 96
  • 113
211
votes
10 answers

Extracting an attribute value with beautifulsoup

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code: import urllib f = urllib.urlopen("http://58.68.130.147") s = f.read() f.close() from BeautifulSoup import…
Barnabe
  • 2,235
  • 2
  • 14
  • 6
186
votes
26 answers

ImportError: No Module Named bs4 (BeautifulSoup)

I'm working in Python and using Flask. When I run my main Python file on my computer, it works perfectly, but when I activate venv and run the Flask Python file in the terminal, it says that my main Python file has "No Module Named bs4." Any…
harryt
  • 2,023
  • 2
  • 14
  • 10
183
votes
16 answers

retrieve links from web page using python and BeautifulSoup

How can I retrieve the links of a webpage and copy the url address of the links using Python?
NepUS
  • 1,899
  • 2
  • 14
  • 9
182
votes
7 answers

How to find children of nodes using BeautifulSoup

I know how to find element with particular class like…
tej.tan
  • 4,067
  • 6
  • 28
  • 29
161
votes
9 answers

Difference between BeautifulSoup and Scrapy crawler?

I want to make a website that shows the comparison between amazon and e-bay product price. Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.
Nishant Bhakta
  • 2,897
  • 3
  • 21
  • 24
158
votes
10 answers

can we use XPath with BeautifulSoup?

I am using BeautifulSoup to scrape an URL and I had the following code, to find the td tag whose class is 'empformbody': import urllib import urllib2 from BeautifulSoup import BeautifulSoup url = …
Shiva Krishna Bavandla
  • 25,548
  • 75
  • 193
  • 313
1
2 3
99 100