Python re.findall

Question

I'm trying to retrieve all the tags containing a 'name' field, and then treat the whole sentence plus the name. This is the test code I have:

sourceCode = '<dirtfields name="one" value="stuff">\n<gibberish name="two"\nwewt>'
namesGroup = re.findall('<.*name="(.*?)".*>', sourceCode, re.IGNORECASE | re.DOTALL)

for name in namesGroup:
    print name

Which output is:

two

And the output I am trying to look for would be

['<dirtfields name="one" value="stuff">', 'one']
['<gibberish name="two"\nwewt>', 'two']

EDIT: Found a way to do it, thanks to doublesharp for the cleaner way to get the 'name' value.

namesGroup = re.findall(r'(<.*?name="([^"]*)".*?>)', sourceCode, re.IGNORECASE | re.DOTALL)

Which will output:

('<dirtfields name="one" value="stuff">', 'one')
('<gibberish name="two"\nwewt>', 'two')

doublesharp · Accepted Answer · 2013-11-16T15:54:12.413

4

Your regex is a bit off - you are matching too much (all the way to the last >). Since you just need to values between the double quotes after name= use the following pattern:

name="([^"]*)"

name=" matches the first part of the attribute you are looking for
([^"]*) creates a grouped match based on any characters that are not a double quote
" matches the double quote after the name attribute value.

And your code would look like this (it's good form to include an r before your pattern):

namesGroup = re.findall(r'name="([^"]*)"', sourceCode, re.IGNORECASE)

edited Nov 16 '13 at 15:54

answered Nov 16 '13 at 15:38

doublesharp

26,888
6
52
73

Thanks a lot doublesharp. That's a cleaner way to get it =) – Neomind Nov 16 '13 at 15:47
`re.DOTALL` is useless here. – Casimir et Hippolyte Nov 16 '13 at 15:50
@CasimiretHippolyte Very true... ahh cut and paste. Removed it for clarity. – doublesharp Nov 16 '13 at 15:54
@Neomind If this did the trick for you it would be appreciated if you marked it as the answer, thanks! – doublesharp Nov 16 '13 at 15:56
It wasn't exactly the answer I was looking for. The one is at the "edit" I did but anyway made me learn more about it... so thanks! – Neomind Nov 16 '13 at 16:49

B.Mr.W. · Answer 2 · 2013-11-16T16:08:18.877

Clearly you are dealing with HTML or XML file and looking for some values of specific attribute.

You will make a directional mistake if you keep working with regular expressions instead of a legit text parser.

Like BeautifulSoup4, the one I like the most, here is an very brief example of how to use it:

from bs4 import BeautifulSoup

sourceCode = '<dirtfields name="one" value="stuff">\n<gibberish name="two"\nwewt>'

soup = BeautifulSoup(sourceCode)
print soup.prettify()
print '------------------------'
for tag in soup.find_all():
    if tag.has_key('name'):
        print tag, tag['name']

The output looks a bit ugly now (the output is even wrong), but this shows that how beautifulsoup will auto fix your broken html and easily locate the attribute you want.

<html>
 <body>
  <dirtfields name="one" value="stuff">
   <gibberish name="two" wewt="">
   </gibberish>
  </dirtfields>
 </body>
</html>
------------------------
<dirtfields name="one" value="stuff">
<gibberish name="two" wewt=""></gibberish></dirtfields> one
<gibberish name="two" wewt=""></gibberish> two

Add Beautifulsoup to your favorite Stackoverflow tags and you will be surprise how good it is and how many people are doing the same thing as you with a more powerful tool!

Moreover, you [cannot parse HTML with regular expressions](http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) because [HTML isn't a regular language](http://howiprovedit.com/archives/44)! There's even an entire [SO Question on this](http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not). — Livius, Nov 16 '13 at 17:54

score 2 · Answer 3 · answered Nov 16 '13 at 15:52

2

(?<=name=")[^"]*

If you wanted to match only the name without having a capture group, you could use:

re.findall(r'(?<=name=")[^"]*', sourceCode, re.IGNORECASE )

Output: ['one', 'two']

Of course capture groups are an equally acceptable solution.

answered Nov 16 '13 at 15:52

OGHaza

4,795
7
23
29

Casimir et Hippolyte · Answer 4 · 2013-11-16T16:11:01.567

It is a pattern that allows escaped quotes inside the value and that avoid (for performance reasons) lazy quantifiers. This is the reason why it's a bit long but more waterproof:

myreg = re.compile(r"""
    < (?: [^n>]+ | \Bn | n(?!ame\s*=) )+   # begining of the tag 
                                           # until the name attribute
    name \s* = \s* ["']?                   # attribute until the value
    ( (?: [^\s\\"']+ | \\{2} | \\. )* )    # value
    [^>]*>                                 # end of the tag
""", re.X | re.I | re.S)

namesGroup = myreg.findall(sourceCode)

However using BS4 is a nice solution for your case.

Python re.findall

4 Answers4

Linked