Get all groups from a long line

Question

I have the following string:

aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee

I try to get the following values from this line:

11r11
222
3333a333
44a444444

In other words, to get the values between <a class="c-item_foot" href="/news/*/"> and </a></div>. I'm trying to get it in the following way:

text=open("./string.txt","r").read()
print(u'\n'.join(re.findall(r"<a class=\"c-item_foot.*>(.*)</a></div>", text)))

But only get the last group 44a444444. Can anyone show me the correct example?

Remember not to [parse html with regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — jfaccioni, Oct 29 '19 at 20:28
Why you can't use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)? `soup = BeautifulSoup("your_str", 'html.parser') for link in soup.find_all('a'): print(link.text)` — Danila Ganchar, Oct 29 '19 at 20:28

score 1 · Answer 1 · answered Oct 29 '19 at 20:30

1

I suggest you use a html parsing library like BeautifulSoup.

html_doc = 'aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
values = [tag.text for tag in soup.find_all('a')]

answered Oct 29 '19 at 20:30

kartheek

690
6
9

Depending on rest of html (if there is more...) using class is certainly faster and likely more selective. – QHarr Oct 29 '19 at 21:51

score 0 · Answer 2 · answered Oct 29 '19 at 20:32

0

You have the right approach, but you have to use the lazy evaluation method for regex. Try this instead:

<a class=\"c-item_foot.*?>(.*?)<\/a><\/div>

You can play with regex here: https://regex101.com/r/pggVVJ/1

answered Oct 29 '19 at 20:32

usernamenotfound

1,540
2
11
18

score 0 · Answer 3 · answered Oct 29 '19 at 20:49

Python has an HTML parser that delivers what you expect in this case.

The html.parser documentation is here.

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.data = []
        self.a_tag = None

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            self.a_tag = True

    def handle_data(self, data):
        if self.a_tag:
            self.data.append(data)
            self.a_tag = False

string = """aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee"""
parser = MyHTMLParser()
parser.feed(string)
print(parser.data)

OUTPUT:

['11r11', '222', '3333a333', '44a444444']

Get all groups from a long line

3 Answers3