Python Regex: How do I use regular expression to read in a text file and extract only the names from any lines that contains 2 names

Question

Suppose I only need the lines in a txt file that contain two names such as:

<td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

And the txt file contains the text below:

<tr >

     <th style="text-align:right; background-color:white; color:black" scope="col">Rank</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male name">Name</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male number">Number</th>

     <th style="text-align:right; background-color:white; color:black"  scope="col" abbr="female name">Name</th>

     <th style="text-align:right; background-color:white; color:black"  abbr="female number">Number</th>

   </tr>

   </thead>

   <tbody>

<tr ><td>1</td>

  <td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

<tr ><td>2</td>

  <td >Michael</td> <td>250,554</td> <td >Madison</td> <td>193,152</td></tr>

<tr ><td>3</td>

  <td >Joshua</td> <td>231,926</td> <td >Emma</td> <td>181,257</td></tr>

<tr ><td>4</td>

  <td >Matthew</td> <td>221,513</td> <td >Olivia</td> <td>156,000</td></tr>

<tr ><td>5</td>

Using the regex "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*" how do I extract the names only using re.findall to compile a list?

Thank you in advance.

Does it need to be a regular expression? Couldn't something like `'Jacob' in line or 'Emily' in line`, where `line` is an individual line in the file, suffice? — dddJewelsbbb, Dec 04 '19 at 01:48
@dddJewelsbbb Yes this needs to be done in regular expression. Supposedly it could be done in one line of code. — Karilyn Lee, Dec 04 '19 at 02:38

Emma · Answer 1 · 2019-12-04T02:31:28.020

Method 1

I guess, you can simply call that using your expression, or maybe a bit modified version of that, such as with:

^\h*<td\s*>([^<\r\n]+)<\/td\s*>.*<td\s*>([^<\r\n]+)<\/td\s*>

RegEx Demo

Test 1

import re

string = '''
<tr >

     <th style="text-align:right; background-color:white; color:black" scope="col">Rank</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male name">Name</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male number">Number</th>

     <th style="text-align:right; background-color:white; color:black"  scope="col" abbr="female name">Name</th>

     <th style="text-align:right; background-color:white; color:black"  abbr="female number">Number</th>

   </tr>

   </thead>

   <tbody>

<tr ><td>1</td>

  <td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

<tr ><td>2</td>

  <td >Michael</td> <td>250,554</td> <td >Madison</td> <td>193,152</td></tr>

<tr ><td>3</td>

  <td >Joshua</td> <td>231,926</td> <td >Emma</td> <td>181,257</td></tr>

<tr ><td>4</td>

  <td >Matthew</td> <td>221,513</td> <td >Olivia</td> <td>156,000</td></tr>

<tr ><td>5</td>
'''

print(re.findall(r'<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*', string))

Output 1

[('Jacob', 'Emily'), ('Michael', 'Madison'), ('Joshua', 'Emma'), ('Matthew', 'Olivia')]

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.

Method 2

Maybe, a better approach would be to use bs4 though:

Test 2

import re
from bs4 import BeautifulSoup

f = open('/path/to/your/filename.txt', 'r+b')
names = []
try:
    soup = BeautifulSoup(f.read(), 'html.parser')
finally:
    f.close()

    for l in soup.find_all('td'):
        if re.match(r'\D+', l.text):
            names.append(l.text)
print(names)

Output 2

['Jacob', 'Emily', 'Michael', 'Madison', 'Joshua', 'Emma', 'Matthew', 'Olivia']

I've tried that and it's only returning empty list. – Karilyn Lee Dec 04 '19 at 01:54 — Karilyn Lee, Dec 04 '19 at 01:54
is it possible to do this code by inputing a file? – Karilyn Lee Dec 04 '19 at 01:57 — Karilyn Lee, Dec 04 '19 at 01:57