0

I need to extract some usernames from a list. I work in Jupyter Notebooks and use Python. I believe that regex would be the way to go, but feel free to recommend a different approach.

The names are surrounded by the same characters. The following characters precede the names:

bold;"> 

and the following characters are directly behind the names:

</a>

Here is a small fraction of the data that I would like to extract names from:

<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563

2 Answers2

2

You can use BeautifulSoup instead of regex:

# Python env: pip install bs4
# Anaconda env: conda install bs4

from bs4 import BeautifulSoup

html = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""

soup = BeautifulSoup(html)

for link in soup.select('a[class="model"]'):
    print(link.text)

Output:

HoneyxLover
RubinRosey
Corralien
  • 109,409
  • 8
  • 28
  • 52
0

This code is with re module (RegEx):

import re
string = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""
print(re.findall(r'bold;">(.*)</a>',string))

This code is with xml module:

import xml.etree.ElementTree as ET

string = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""

split_list = string.strip(',').split(',\n')
for i in split_list:
    element = ET.fromstring(i.strip())
    if element.attrib['class']=='model':
        print(element.text)
Suneesh Jacob
  • 806
  • 1
  • 7
  • 15