Scraping HTML forms with regex

Question

I have a form like this:

<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form>

And i want the values in this oder: method action names

["get", "search.php", ["query"]]

I don't know how to do it in regex. Because this is also multilined string. I am also very new to regex.

You wouldn't do it with regex. Why would you want to do it with regex? [Just don't do it](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). — Daniel Roseman, Mar 01 '15 at 15:06
According to me best way to go with any `xml` parsing module — Vivek Sable, Mar 01 '15 at 15:11
I would have a read of http://stackoverflow.com/a/1732454/1319998 before trying to parse HTML with regex :-) — Michal Charemza, Mar 01 '15 at 15:38

Mazdak · Answer 1 · 2015-03-01T15:28:33.087

3

As a proper way for parsing a HTML or XML document you should use a html(or xml) parser like beautifulsoup or lxml or ... . but if you just want to use regex that not be recommended you can use re.findall as following :

>>> [i for j in re.findall(r'method="([^ >"]*)"|action="([^ >"]*)"|name="([^ >"]*)"',s) for i in j if i]
['get', 'search.php', 'query']

[^ >]* match a string that not contain space and >.

edited Mar 01 '15 at 15:28

answered Mar 01 '15 at 15:11

Mazdak

105,000
18
159
188

You should remove the double quotes from your character classes to avoid a lot of backtraking and possible false results. – Casimir et Hippolyte Mar 01 '15 at 15:20
@CasimiretHippolyte Yes but then it doesn't be the OP's expected result! – Mazdak Mar 01 '15 at 15:22
I removed the double quotes in the pattern. And stripped/removed the double quotes after the matching. – Mar 01 '15 at 15:24
1

@Emyen yep it will be a better idea! – Mazdak Mar 01 '15 at 15:25
No, you didn't understand : `method="([^ >"]*)"` is what I mean. – Casimir et Hippolyte Mar 01 '15 at 15:26
1

@CasimiretHippolyte ahan, yes, sorry! thanks for reminding and teaching i didn't notice that! – Mazdak Mar 01 '15 at 15:29

score 1 · Answer 2 · edited May 23 '17 at 12:22

I do agree with Michal Charemza's comment to go ahead and read the following post.

I will give an example using Lxml. It's a very powerful tool to parse and analyze HTML.

import lxml
from lxml.html import fromstring

html = fromstring("""<form id="search" method="get" action="search.php">
                     <input type="text" name="query" value="Search"/>
                     <input type="submit" value="Submit">
                     </form> """)
form = html.forms[0] # selecting the first form in the HTML page

# Extracting the data out of the form
print form.action, form.method, form.inputs.keys()

Enjoy,

Abdul

score 0 · Answer 3 · edited May 10 '21 at 09:38

0

You could use BeautifulSoup library.

>>> from bs4 import BeautifulSoup
>>> s = '''<form id="search" method="get" action="search.php">
      <input type="text" name="query" value="Search"/>
      <input type="submit" value="Submit">
</form> '''
>>> soup = BeautifulSoup(s)
>>> k = []
>>> for i in soup.find_all('form'):
        k.append(i['method'])
        k.append(i['action'])
        k.append([j['name'] for j in i.find_all('input', attrs={'name':True})])

    
>>> k
['get', 'search.php', ['query']]

edited May 10 '21 at 09:38

Winand

2,093
3
28
48

answered Mar 01 '15 at 15:27

Avinash Raj

172,303
28
230
274

3

Why even use `re` here? Just add the name argument to the list as you already are, no need to regex out the name from the element converted to a string... – Jon Clements Mar 01 '15 at 15:45

Scraping HTML forms with regex

3 Answers3