Extract Options From Dropdown List Extracted From Website

Question

I've been trying (unsuccessfully) to solve this problem for a few hours and need some help. I used Firebug to extract a couple hundred lines of HTML that look like this:

<option value="1b4f4aed-cf1f-4b39-ae27">Foo</option>
<option value="1a05f93f-dd51-449d-b039">Bar</option>
<option value="f62d2d29-29fc-4f7c-9331">Bacon</option>

I saved the lines to a text file. What I want is a (Python preferred, with Ruby as an alternative) script to open process and close the file. The processing should result in a new text file being saved that looks like this:

Foo
Bar
Bacon

That's it. Thanks in advance for your help.

I'm such a beginner... as far as I got was Googling and searching StackOverflow for things like "parse HTML", "parse HTML with Python", "extract options from dropdown list", etc. I found a bunch of interesting stuff (BeautifulSoup, Scrapy, YouTube videos, etc.) and wrote up some pseudocode but I kind of in that "lost" stage. Tired and going to bed now. I'm sorry if I am posting too soon in my struggle. To give you an idea of my level, I'm halfway throw a new beginner's book on Python. Thks. — chrisco, Jan 21 '13 at 23:32

score 2 · Accepted Answer · edited May 23 '17 at 12:21

2

Per your comment above, I would suggest BeautifulSoup with anything HTML related. Since you are early in your learning stage, probably best to associate 'HTML' with 'BeautifulSoup' (and not regex :) ). Here is a very basic example:

In [1]: from bs4 import BeautifulSoup

In [2]: html = """
<option value="1b4f4aed-cf1f-4b39-ae27">Foo</option>
<option value="1a05f93f-dd51-449d-b039">Bar</option>
<option value="f62d2d29-29fc-4f7c-9331">Bacon</option>
"""

In [3]: soup = BeautifulSoup(html)

In [4]: for option in soup.find_all('option'):
   ...:     print option.text
   ...:     
Foo
Bar
Bacon

Here we pass our HTML to BeautifulSoup and assign it to the soup variable. Now we have an object that contains our HTML and a large amount of methods for interacting with it in a user-friendly way. Here, we use the find_all method (documentation here) to find all option tags in our HTML. Now when we iterate, we are iterating through Tag objects, which have their own special properties/methods. Here we pick one of them (.text) to display the text of the Tag element (which in this case will be the text enclosed in the tag).

edited May 23 '17 at 12:21

Community

1
1

answered Jan 21 '13 at 23:36

RocketDonkey

36,383
7
80
84

+1 for the part about not using regular expressions to parse HTML :) – Joel Cornett Jan 22 '13 at 02:57
@JoelCornett Ha, I wonder how many people that post has 'saved' (I'm one, for sure :) ). – RocketDonkey Jan 22 '13 at 03:02
Thank you very much @RocketDonkey! That solved my problem and gave me a learning boost. As an exercise (or exercises) I will extend what you posted to work with files, data structures, and iterators. From there, I will do some BeautifulSoup tutorials and documentation reading. Seems really powerful and usable. Anyway, thanks again :) – chrisco Jan 22 '13 at 14:14
1

@chrisco No problem at all, happy to help :) One thing that will be interesting for you is to get a basic knowledge of BeautifulSoup down, continue your studies into data structures, etc., and then come back to BeautifulSoup after you have a solid foundation. You'll then be able to dig into what is actually going on ('What is a `Tag`, actually? How could I create my own `Tag`?') and open up a whole new level of understanding. You're in for a good time :) – RocketDonkey Jan 22 '13 at 14:57

Extract Options From Dropdown List Extracted From Website

1 Answers1