1

https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue

I am trying to find the names of the companies in order of revenue. It's a bit challenging because the titles all have differently formatted tags. If anyone could come up with a solution I'd be very grateful.

An example of my problem:

I'd like to match "Wal-Mart Stores Inc." and then "Sinopec Group" and so forth in order.

<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>

...further in the document...

<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>

Thanks in advance.

2 Answers2

0

Group the content of the title attribute in a tags. It checks if it's the first table cell after the ranking.

regex = /th>\n<td.*?><a .* ?title="(.*?)".*>/

It's known to work currently. But it's a fairly brittle method. Check the Online Regex Tester for regex details information

0

This can be done easily with beautifulsoup

from bs4 import BeautifulSoup as soup

x = ['<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td>', '<td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>']
tmp = [soup(y).find('td').find('a') for y in x]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

If its a single string, then you can use

x = '''<td><a href="/wiki/Wal-Mart_Stores,_Inc." title="Wal-Mart Stores, Inc."class="mw-redirect">Wal-Mart Stores, Inc.</a></td> <td style="background: #ffffcc;"><a href="/wiki/Sinopec_Group" title="Sinopec Group" class="mw-redirect">Sinopec Group</a></td>'''
tmp = [y.find('a') for y in soup(x).find_all('td')]
lst = [x['title'].strip() for x in tmp if x.has_attr('title')]
print(lst)

If you still want to use regex, then

<td.*?<a.*? title\s*=\s*"([^"]+).*?</td> 

NOTE :- Match in first capturing group

Regex Demo

rock321987
  • 10,942
  • 1
  • 30
  • 43