3

I have the following string:

aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee

I try to get the following values from this line:

  • 11r11
  • 222
  • 3333a333
  • 44a444444

In other words, to get the values between <a class="c-item_foot" href="/news/*/"> and </a></div>. I'm trying to get it in the following way:

text=open("./string.txt","r").read()
print(u'\n'.join(re.findall(r"<a class=\"c-item_foot.*>(.*)</a></div>", text)))

But only get the last group 44a444444. Can anyone show me the correct example?

KarlsD
  • 649
  • 1
  • 6
  • 12

3 Answers3

1

I suggest you use a html parsing library like BeautifulSoup.

html_doc = 'aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee'
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
values = [tag.text for tag in soup.find_all('a')]
kartheek
  • 690
  • 6
  • 9
  • Depending on rest of html (if there is more...) using class is certainly faster and likely more selective. – QHarr Oct 29 '19 at 21:51
0

You have the right approach, but you have to use the lazy evaluation method for regex. Try this instead:

<a class=\"c-item_foot.*?>(.*?)<\/a><\/div>

You can play with regex here: https://regex101.com/r/pggVVJ/1

usernamenotfound
  • 1,540
  • 2
  • 11
  • 18
0

Python has an HTML parser that delivers what you expect in this case.

The html.parser documentation is here.



from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.data = []
        self.a_tag = None

    def handle_starttag(self, tag, attrs):
        if tag == "a":
            self.a_tag = True

    def handle_data(self, data):
        if self.a_tag:
            self.data.append(data)
            self.a_tag = False

string = """aaa<a class="c-item_foot" href="/news/a/">11r11</a></div>bbb<a class="c-item_foot" href="/news/b/">222</a></div>ccgc<a class="c-item_foot" href="/news/c/">3333a333</a></div>ddd<a class="c-item_foot" href="/news/d/">44a444444</a></div>eee"""
parser = MyHTMLParser()
parser.feed(string)
print(parser.data)

OUTPUT:

['11r11', '222', '3333a333', '44a444444']
dmmfll
  • 2,666
  • 2
  • 35
  • 41