Multiple-line output from a single-line match in Python

Question

I'm still incredibly new at Python, but am trying to write code that will parse the weather from NOAA and display it in the order of our radio broadcast.

I've managed to put together a current conditions list that uses a python expression where the html file gets chopped up into a list of lines, and then is re-output in the proper order, but each of those was a single line of data. That code looked like this:

#other function downloads  
#http://www.arh.noaa.gov/wmofcst_pf.php?wmo=ASAK48PAFC&type=public
#and renames it currents.html
from bs4 import BeautifulSoup as bs
import re
soup = bs(open('currents.html')
weatherRaw = soup.pre.string
towns = ['PAOM', 'PAUN', 'PAGM', 'PASA']
townOut = []
weatherLines = weatherRaw.splitlines()
for i in range(len(towns)):
    p = re.compile(towns[i] + '.*')
    for line in weatherLines:
        matched = p.match(line)
        if matched:
            townOut.append(matched.group())

Now that I'm working on the forecast portion, I'm running into a problem, since each forecast necessarily runs over multiple lines, and I've chopped the file into a list of lines.

So: what I'm looking for is an expression that will allow me to use a similar loop, this time starting the append at the line found and ending it at a line containing just &&. Something like this:

#sample data from http://www.arh.noaa.gov/wmofcst.php?wmo=FPAK52PAFG&type=public
#BeautifulSouped into list fcst (forecast.pre.get_text().splitlines())
zones = ['AKZ214', 'AKZ215', 'AKZ213'] #note the out-of-numerical-order zones
weatherFull = []
for i in range(len(zones)):
    start = re.compile(zones[i] '.*')
    end = re.compile('&&')
    for line in fcst:
        matched = start.match(line)
        if matched:
            weatherFull.append(matched.group())
            #and the other lines of various contents and length
            #until reaching the end match object

What should I do to improve this code? I know it's very verbose, but while I'm starting out, I liked to be able to track what I was doing. Thanks in advance!

score 0 · Accepted Answer · answered Oct 29 '12 at 01:41

Apologies if this isn't quite what you were after (in that case, happy to adjust). Awesome that you are using BeautifulSoup, but you can actually take it one step further. Looking at the HTML, it appears that each block starts with a <a name=zone> structure, and it ends at the next <a name=zone>. That being the case, you can do something like this to pull the corresponding HTML for each zone:

from bs4 import BeautifulSoup

# I put the HTML in a file, but this will work with a URL as well
with open('weather.html', 'r') as f:
  fcst = f.read()

# Turn the html into a navigable soup object
soup = BeautifulSoup(fcst)

# Define your zones
zones = ['AKZ214', 'AKZ215', 'AKZ213']

weatherFull = []

# This is a more Pythonic loop structure - instead of looping over
# a range of len(zones), simply iterate over each element itself
for zone in zones:
  # Here we use BS's built-in 'find' function to find the 'a' element
  # with a name = the zone in question (as this is the pattern).
  zone_node = soup.find('a', {'name': zone})

  # This loop will continue to cycle through the elements after the 'a'
  # tag until it hits another 'a' (this is highly structure dependent :) )
  while True:
    weatherFull.append(zone_node)
    # Set the tag node = to the next node
    zone_node = zone_node.nextSibling
    # If the next node's tag name = 'a', break out and go to the next zone
    if getattr(zone_node, 'name', None)  == 'a':
      break

# Process weatherFull however you like
print weatherFull

Hope this helps (or is at least somewhere in the ballpark of what you wanted!).

That's exactly what I was looking for - I was upset when I couldn't use BeautifulSoup this way on the first set (since there were no tags in that html set). I can't believe I forgot to check if it was tagged this time through! Thank you for your help. :) — Raveler1, Oct 29 '12 at 01:51
@Raveler1 No worries at all! There is a pretty funny post about parsing HTML with regex which you may have seen (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags, just in case) - that definitely pushed me away from regex and into things like BS :) Also, for being incredibly new at Python your code looks great! — RocketDonkey, Oct 29 '12 at 01:54
Thanks for the compliment; I've been hacking away at this for a few days now, and I'm a fan of clean, readable code. So many of my searches revealed a lack of readable code... I'm rather glad I'm a radio broadcaster and not a coder. It's nice to have toolsets for problems, though! I haven't worked with a coding language since BASIC, C++ and a tiny bit of Java in college. It's cool how the languages have their quirks, but generally run off similar processes. That post on parsing HTML is amazing. I can't tell whether it's good or bad that I find it funny, though... ;-) — Raveler1, Oct 29 '12 at 02:18
@Raveler1 You'll love Python then :) The fact that it requires specific indentation makes for a much more readable language in my opinion. And that post makes me laugh every time, especially the more you start to deal with HTML and get queasy thinking about using something other than a true parser :) — RocketDonkey, Oct 29 '12 at 02:22

Multiple-line output from a single-line match in Python

1 Answers1