-1

I've been trying (unsuccessfully) to solve this problem for a few hours and need some help. I used Firebug to extract a couple hundred lines of HTML that look like this:

<option value="1b4f4aed-cf1f-4b39-ae27">Foo</option>
<option value="1a05f93f-dd51-449d-b039">Bar</option>
<option value="f62d2d29-29fc-4f7c-9331">Bacon</option>

I saved the lines to a text file. What I want is a (Python preferred, with Ruby as an alternative) script to open process and close the file. The processing should result in a new text file being saved that looks like this:

Foo
Bar
Bacon

That's it. Thanks in advance for your help.

chrisco
  • 834
  • 2
  • 10
  • 16
  • Can you post what you've tried so far? – Joel Cornett Jan 21 '13 at 23:22
  • I'm such a beginner... as far as I got was Googling and searching StackOverflow for things like "parse HTML", "parse HTML with Python", "extract options from dropdown list", etc. I found a bunch of interesting stuff (BeautifulSoup, Scrapy, YouTube videos, etc.) and wrote up some pseudocode but I kind of in that "lost" stage. Tired and going to bed now. I'm sorry if I am posting too soon in my struggle. To give you an idea of my level, I'm halfway throw a new beginner's book on Python. Thks. – chrisco Jan 21 '13 at 23:32

1 Answers1

2

Per your comment above, I would suggest BeautifulSoup with anything HTML related. Since you are early in your learning stage, probably best to associate 'HTML' with 'BeautifulSoup' (and not regex :) ). Here is a very basic example:

In [1]: from bs4 import BeautifulSoup

In [2]: html = """
<option value="1b4f4aed-cf1f-4b39-ae27">Foo</option>
<option value="1a05f93f-dd51-449d-b039">Bar</option>
<option value="f62d2d29-29fc-4f7c-9331">Bacon</option>
"""

In [3]: soup = BeautifulSoup(html)

In [4]: for option in soup.find_all('option'):
   ...:     print option.text
   ...:     
Foo
Bar
Bacon

Here we pass our HTML to BeautifulSoup and assign it to the soup variable. Now we have an object that contains our HTML and a large amount of methods for interacting with it in a user-friendly way. Here, we use the find_all method (documentation here) to find all option tags in our HTML. Now when we iterate, we are iterating through Tag objects, which have their own special properties/methods. Here we pick one of them (.text) to display the text of the Tag element (which in this case will be the text enclosed in the tag).

Community
  • 1
  • 1
RocketDonkey
  • 36,383
  • 7
  • 80
  • 84
  • +1 for the part about not using regular expressions to parse HTML :) – Joel Cornett Jan 22 '13 at 02:57
  • @JoelCornett Ha, I wonder how many people that post has 'saved' (I'm one, for sure :) ). – RocketDonkey Jan 22 '13 at 03:02
  • Thank you very much @RocketDonkey! That solved my problem and gave me a learning boost. As an exercise (or exercises) I will extend what you posted to work with files, data structures, and iterators. From there, I will do some BeautifulSoup tutorials and documentation reading. Seems really powerful and usable. Anyway, thanks again :) – chrisco Jan 22 '13 at 14:14
  • 1
    @chrisco No problem at all, happy to help :) One thing that will be interesting for you is to get a basic knowledge of BeautifulSoup down, continue your studies into data structures, etc., and then come back to BeautifulSoup after you have a solid foundation. You'll then be able to dig into what is actually going on ('What is a `Tag`, actually? How could I create my own `Tag`?') and open up a whole new level of understanding. You're in for a good time :) – RocketDonkey Jan 22 '13 at 14:57