0

Currently, my code is parsing through the link and printing all of the information from the website. I only want to print a single specific line from the website. How can I go about doing that?

Here's my code:

from bs4 import BeautifulSoup
import urllib.request

r = urllib.request.urlopen("Link goes here").read()
soup = BeautifulSoup(r, "html.parser")

# This is what I want to change. I currently have it printing everything.
# I just want a specific line from the website

print (soup.prettify())
Harrison
  • 5,095
  • 7
  • 40
  • 60

2 Answers2

4
li = soup.prettify().split('\n')
print str(li[line_number-1])
kulan
  • 191
  • 8
3

Don't use pretty print to try and parse tds, select the tag specifically, if the attribute is unique then use that, if the class name is unique then just use that:

td = soup.select_one("td.content")
td = soup.select_one("td[colspan=3]")

If it was the fourth td:

td = soup.select_one("td:nth-of-type(4)")

If it is in a specific table, then select the table and then find the td in the table, trying to split the html into lines and indexing is actually worse than using a regex to parse html.

You can get the specific td using the text from the bold tag preceding the td i.e Department of Finance Building Classification::

In [19]: from bs4 import BeautifulSoup

In [20]: import urllib.request

In [21]: url = "http://a810-bisweb.nyc.gov/bisweb/PropertyProfileOverviewServlet?boro=1&houseno=1&street=park+ave&go2=+GO+&requestid=0"

In [22]: r = urllib.request.urlopen(url).read()

In [23]: soup = BeautifulSoup(r, "html.parser")

In [24]: print(soup.find("b",text="Department of Finance Building Classification:").find_next("td").text)
O6-OFFICE BUILDINGS

Pick the nth table and row:

In [25]: print(soup.select_one("table:nth-of-type(8) tr:nth-of-type(5) td[colspan=3]").text)
O6-OFFICE BUILDINGS
Community
  • 1
  • 1
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321