-1

I want to find several tags in a webpage using regular expression, they have the same pattern: data-tag-slug="NAME", like this(only a small section):

...category="rating" data-tag-id="40482" data-tag-name="safe" data-tag-slug="safe"><a cla...
...category="" data-tag-id="42350" data-tag-name="solo" data-tag-slug="solo"><a cla...

And I coded tagName = r'.*data-tag-slug="(\w+)".*', use re.findall(tagName, html), yet I can only get one result, which is the very last item that fits the pattern. I wonder how can I get all of them.

P.S. By "the very last item", I mean that there're several tags that fit the pattern, but the code can only get the last one by the order of appearance in the html.

Amarth Gûl
  • 1,040
  • 2
  • 14
  • 33
  • https://www.crummy.com/software/BeautifulSoup/bs4/doc/ – G_M Oct 03 '18 at 01:13
  • @G_M I know the existence of beautifulsoup, but re is generally faster when dealing with simple match, and I am going to deal with a million webpages, so I don't really want to use beautifulsoup – Amarth Gûl Oct 03 '18 at 01:17
  • Sounds like you've got it all figured out. – G_M Oct 03 '18 at 01:19
  • Your regex matches the entire string if it matches at all. Get rid of the `.*`s at both ends. – jasonharper Oct 03 '18 at 01:19
  • @jasonharper Didn't think it'll be so simple... thank you – Amarth Gûl Oct 03 '18 at 01:21
  • @G_M No, I was told by several people or books that re is complex but fast while beautifulsoup is easy but slower. And I believe it is wiser to choose based on what I am going to do instead of simply seeking some " More Advanced" way – Amarth Gûl Oct 03 '18 at 01:28
  • @AmarthGûl Faster than `lxml` which is written in C? https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – G_M Oct 03 '18 at 01:29
  • @G_M I didn't see any sentences in your link that points out directly bs is faster than re. And since our conversation is becoming salty, I'll be honest and ask: Don't you feel rude if somebody says "Go eat meat" when others are asking "How to cook bread?" – Amarth Gûl Oct 03 '18 at 01:35
  • @AmarthGûl https://stackoverflow.com/a/1732454/8079103 – G_M Oct 03 '18 at 01:39
  • @G_M I see you are the kind of "Highly educated" people who only speak with "Evidence", so which one I am supposed to listen? `You can't parse [X]HTML with regex.` or `You totally can parse context-free grammars with regex`? Do you think you are cool judging me without knowing what I am doing? Must I use your Majesty's Majestic Suggestion when I know what I'm facing is totally safe for regex? – Amarth Gûl Oct 03 '18 at 01:44
  • @AmarthGûl https://stackoverflow.com/a/590789/8079103 – G_M Oct 03 '18 at 01:45
  • @G_M open up your eye your Majesty. Please see the entire post instead of only the one answer that fits you. – Amarth Gûl Oct 03 '18 at 01:48
  • @AmarthGûl https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ – G_M Oct 03 '18 at 01:50

1 Answers1

1

Just drop the greedy .* from your regex:

import re
txt = """category="rating" data-tag-id="40482" data-tag-name="safe" data-tag-slug="safe">category="" data-tag-id="42350" data-tag-name="solo" data-tag-slug="solo">"""
out = re.findall(r'data-tag-slug="(\w+)"', txt)
print(out)
#> ['safe', 'solo']

Created on 2018-10-02 by the reprexpy package

import reprexpy
print(reprexpy.SessionInfo())
#> Session info --------------------------------------------------------------------
#> Platform: Darwin-17.7.0-x86_64-i386-64bit (64-bit)
#> Python: 3.6
#> Date: 2018-10-02
#> Packages ------------------------------------------------------------------------
#> reprexpy==0.1.1
Chris
  • 1,575
  • 13
  • 20