6

Assume that there is a link "http://www.someHTMLPageWithTwoForms.com" which is basically a HTML page having two forms (say Form 1 and Form 2). I have a code like this ...

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
h = httplib2.Http('.cache')
response, content = h.request('http://www.someHTMLPageWithTwoForms.com')
for field in BeautifulSoup(content, parseOnlyThese=SoupStrainer('input')):
        if field.has_key('name'):
                print field['name']

This returns me all the field names that belong both to the Form 1 and Form 2 of my HTML page. Is there any way I can get only the Field names that belong to a particular form (say Form 2 only)?

mdeous
  • 17,513
  • 7
  • 56
  • 60
Bhavani Kannan
  • 1,269
  • 10
  • 29
  • 46

4 Answers4

5

If it's only 2 forms you may try this one:

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll('form')
for field in forms[1]:
    if field.has_key('name'):
            print field['name']

If it's not only about the 2nd form you make it more specific (by an id or class attributs

from BeautifulSoup import BeautifulSoup

forms = BeautifulSoup(content).findAll(attrs={'id' : 'yourFormId'})
for field in forms[0]:
    if field.has_key('name'):
            print field['name']
Anas
  • 1,761
  • 1
  • 13
  • 22
1

If you have lxml and cssselect python packages installed:

from lxml import html
def parse_form(form):
    tree = html.fromstring(form)
    data = {}
    for e in tree.cssselect('form input'):
        if e.get('name'):
            data[e.get('name')] = e.get('value')
    return data
Rusty
  • 4,138
  • 3
  • 37
  • 45
1

If you have attribute name and value, you can search

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []
CivFan
  • 13,560
  • 9
  • 41
  • 58
Kracekumar
  • 19,457
  • 10
  • 47
  • 56
1

Doing this kind of parsing would also be quite easy using lxml (which i personally prefer over BeautifulSoup because of its Xpath support). For example, the following snippet would print all fields names (if they have one) which belong to forms named "form2":

# you can ignore this part, it's only here for the demo
from StringIO import StringIO
HTML = StringIO("""
<html>
<body>
    <form name="form1" action="/foo">
        <input name="uselessInput" type="text" />
    </form>
    <form name="form2" action="/bar">
        <input name="firstInput" type="text" />
        <input name="secondInput" type="text" />
    </form>
</body>
</html>
""")

# here goes the useful code
import lxml.html
tree = lxml.html.parse(HTML) # you can pass parse() a file-like object or an URL
root = tree.getroot()
for form in root.xpath('//form[@name="form2"]'):
    for field in form.getchildren():
        if 'name' in field.keys():
            print field.get('name')
mdeous
  • 17,513
  • 7
  • 56
  • 60
  • 2
    This is not so good, it only looks at immediate children of the form element and does not check whether they are form inputs (other elements may also have name attributes). – janek37 Jun 21 '17 at 11:35