Regular expression same pattern only applies to 1 result

Question

I want to find several tags in a webpage using regular expression, they have the same pattern: data-tag-slug="NAME", like this(only a small section):

...category="rating" data-tag-id="40482" data-tag-name="safe" data-tag-slug="safe"><a cla...
...category="" data-tag-id="42350" data-tag-name="solo" data-tag-slug="solo"><a cla...

And I coded tagName = r'.*data-tag-slug="(\w+)".*', use re.findall(tagName, html), yet I can only get one result, which is the very last item that fits the pattern. I wonder how can I get all of them.

P.S. By "the very last item", I mean that there're several tags that fit the pattern, but the code can only get the last one by the order of appearance in the html.

@G_M I know the existence of beautifulsoup, but re is generally faster when dealing with simple match, and I am going to deal with a million webpages, so I don't really want to use beautifulsoup — Amarth Gûl, Oct 03 '18 at 01:17
Your regex matches the entire string if it matches at all. Get rid of the `.*`s at both ends. — jasonharper, Oct 03 '18 at 01:19
@G_M No, I was told by several people or books that re is complex but fast while beautifulsoup is easy but slower. And I believe it is wiser to choose based on what I am going to do instead of simply seeking some " More Advanced" way — Amarth Gûl, Oct 03 '18 at 01:28
@AmarthGûl Faster than `lxml` which is written in C? https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser — G_M, Oct 03 '18 at 01:29
@G_M I didn't see any sentences in your link that points out directly bs is faster than re. And since our conversation is becoming salty, I'll be honest and ask: Don't you feel rude if somebody says "Go eat meat" when others are asking "How to cook bread?" — Amarth Gûl, Oct 03 '18 at 01:35
@G_M I see you are the kind of "Highly educated" people who only speak with "Evidence", so which one I am supposed to listen? `You can't parse [X]HTML with regex.` or `You totally can parse context-free grammars with regex`? Do you think you are cool judging me without knowing what I am doing? Must I use your Majesty's Majestic Suggestion when I know what I'm facing is totally safe for regex? — Amarth Gûl, Oct 03 '18 at 01:44
@G_M open up your eye your Majesty. Please see the entire post instead of only the one answer that fits you. — Amarth Gûl, Oct 03 '18 at 01:48
@AmarthGûl https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ — G_M, Oct 03 '18 at 01:50

score 1 · Accepted Answer · answered Oct 03 '18 at 01:20

Just drop the greedy .* from your regex:

import re
txt = """category="rating" data-tag-id="40482" data-tag-name="safe" data-tag-slug="safe">category="" data-tag-id="42350" data-tag-name="solo" data-tag-slug="solo">"""
out = re.findall(r'data-tag-slug="(\w+)"', txt)
print(out)
#> ['safe', 'solo']

Created on 2018-10-02 by the reprexpy package

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-10-02
#> Packages ------------------------------------------------------------------------
#> reprexpy==0.1.1

Regular expression same pattern only applies to 1 result

1 Answers1