1

I am new to Python and fairly new to programming in general. I'm trying to work out a script that uses BeautifulSoup to parse https://www.state.nj.us/mvc/ for any text that's red. The table I'm looking at is relatively simple HTML:

<html>
 <body>
  <div class="alert alert-warning alert-dismissable" role="alert">
   <div class="table-responsive">
    <table class="table table-sm" align="center" cellpadding="0" cellspacing="0">
     <tbody>
      <tr>
       <td width="24%">
        <strong>
         <font color="red">Bakers Basin</font>
        </strong>
       </td>
       <td width="24%">
        <strong>Oakland</strong>
       </td>
 ...
 ...
 ...
      </tr>
     </tbody>
    </table>
   </div>
  </div>
 </body>
</html>

From the above I want to find Bakers Basin, but not Oakland, for example.

Here's the Python I've written (adapted from Cory Althoff The Self-Taught Programmer, 2017, Triangle Connection LCC):

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        soup = BeautifulSoup(html, parser)
        tabledmv = soup.find_all("font color=\"red\"")
        for tag in tabledmv:
            print("\n" + tabledmv.get_text())


website = "https://www.state.nj.us/mvc/"
Scraper(website).scrape()

I seem to be missing something here though because I can't seem to get this to scrape through the table and return anything useful. The end result is I want to add the time module and run this every X minutes, then to have it log a message somewhere for when each site goes red. (This is all so my wife can figure out the least crowded DMV to go to in New Jersey!).

Any help or guidance is much appreciated on getting the BeautifulSoup bit working.

Rotes328
  • 23
  • 4
  • 1
    Try this: `soup.find_all('font[color="red"]')` instead. See: [_**MDN - Attribute selectors**_](https://developer.mozilla.org/en-US/docs/Web/CSS/Attribute_selectors) – Mr. Polywhirl Oct 05 '20 at 17:04
  • 1
    @Mr.Polywhirl The syntax is `soup.select()` for CSS Selectors – MendelG Oct 05 '20 at 17:15
  • I found my problem - seems to be that this page has an element that hides the rest of the HTML. I'll have to work on figuring out how to get rid of this. – Rotes328 Oct 05 '20 at 17:54
  • [Try Selenium](https://stackoverflow.com/a/58773630/1762224). Looks like you are waiting for content to load dynamically. – Mr. Polywhirl Oct 05 '20 at 18:17

2 Answers2

0

The table is actually loaded from this site.

To only get text that's red you can use the CSS selector soup.select('font[color="red"]') as @Mr. Polywhirl mentioned:

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        soup = BeautifulSoup(html, parser)
        tabledmv = soup.select('font[color="red"]')[1:]
        for tag in tabledmv:
            print(tag.get_text())


website = "https://www.state.nj.us/mvc/locations/agency.htm"
Scraper(website).scrape()
MendelG
  • 14,885
  • 4
  • 25
  • 52
  • Thank you, this was exactly my issue. Just one question: what does [1:] do in the select method `tabledmv = soup.select('font[color="red"]')[1:]` – Rotes328 Oct 05 '20 at 18:35
  • If you remove `[1:]` from the output, you see that the text `RED` will be printed, since we don't want that, we use [list slicing](https://stackoverflow.com/questions/509211/understanding-slice-notation) – MendelG Oct 05 '20 at 18:41
0

The data is loaded from other location, in this case 'https://www.state.nj.us/mvc/locations/agency.htm'. To get the towns + header for each town, you can use this example:

import requests 
from bs4 import BeautifulSoup


url = 'https://www.state.nj.us/mvc/locations/agency.htm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

for t in soup.select('td:has(font)'):
    i = t.find_previous('tr').select('td').index(t)
    if i < 2:
        print('{:<20} {}'.format(' '.join(t.text.split()), 'Licensing Centers'))
    else:
        print('{:<20} {}'.format(' '.join(t.text.split()), 'Vehicle Centers'))

Prints:

Bakers Basin         Licensing Centers
Cherry Hill          Vehicle Centers
Springfield          Vehicle Centers
Bayonne              Licensing Centers
Paterson             Licensing Centers
East Orange          Vehicle Centers
Trenton              Vehicle Centers
Rahway               Licensing Centers
Hazlet               Vehicle Centers
Turnersville         Vehicle Centers
Jersey City          Vehicle Centers
Wallington           Vehicle Centers
Delanco              Licensing Centers
Lakewood             Vehicle Centers
Washington           Vehicle Centers
Eatontown            Licensing Centers
Edison               Licensing Centers
Toms River           Licensing Centers
Newton               Vehicle Centers
Freehold             Licensing Centers
Runnemede            Vehicle Centers
Newark               Licensing Centers
S. Brunswick         Vehicle Centers
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91