What regex could I use to extract words that are surrounded by specific characters?

Question

I need to extract some usernames from a list. I work in Jupyter Notebooks and use Python. I believe that regex would be the way to go, but feel free to recommend a different approach.

The names are surrounded by the same characters. The following characters precede the names:

bold;">

and the following characters are directly behind the names:

</a>

Here is a small fraction of the data that I would like to extract names from:

<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,

[Have you tried using an HTML parser instead?](https://stackoverflow.com/a/1732454/3001761) — jonrsharpe, Aug 20 '21 at 19:35
The comment of @jonrsharpe is wise. If your source is in HTML it's better to use an HTML parser like BeautifulSoup. — Corralien, Aug 20 '21 at 19:45

score 2 · Answer 1 · answered Aug 20 '21 at 19:42

You can use BeautifulSoup instead of regex:

# Python env: pip install bs4
# Anaconda env: conda install bs4

from bs4 import BeautifulSoup

html = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""

soup = BeautifulSoup(html)

for link in soup.select('a[class="model"]'):
    print(link.text)

Output:

HoneyxLover
RubinRosey

Suneesh Jacob · Answer 2 · 2021-08-20T20:14:22.847

This code is with re module (RegEx):

import re
string = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""
print(re.findall(r'bold;">(.*)</a>',string))

This code is with xml module:

import xml.etree.ElementTree as ET

string = """<a class="model" href="#" style="color:#FF6EC7;font-family:'Verdana';font-weight:bold;">HoneyxLover</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/383/38344261/avatar.300x300.jpg');"></a>,
<a class="model" href="#" style="color:#FF1CAE;font-family:'Comic Sans MS', 'ChalkboardSE-Regular';font-weight:bold;">RubinRosey</a>,
  <a class="model model_img" href="#" style="background-image:url('https://img.mfcimg.com/photos2/228/22826417/avatar.300x300.jpg');"></a>,"""

split_list = string.strip(',').split(',\n')
for i in split_list:
    element = ET.fromstring(i.strip())
    if element.attrib['class']=='model':
        print(element.text)

What regex could I use to extract words that are surrounded by specific characters?

2 Answers2