-1

Suppose I only need the lines in a txt file that contain two names such as:

<td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

And the txt file contains the text below:

<tr >

     <th style="text-align:right; background-color:white; color:black" scope="col">Rank</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male name">Name</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male number">Number</th>

     <th style="text-align:right; background-color:white; color:black"  scope="col" abbr="female name">Name</th>

     <th style="text-align:right; background-color:white; color:black"  abbr="female number">Number</th>

   </tr>

   </thead>

   <tbody>

<tr ><td>1</td>

  <td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

<tr ><td>2</td>

  <td >Michael</td> <td>250,554</td> <td >Madison</td> <td>193,152</td></tr>

<tr ><td>3</td>

  <td >Joshua</td> <td>231,926</td> <td >Emma</td> <td>181,257</td></tr>

<tr ><td>4</td>

  <td >Matthew</td> <td>221,513</td> <td >Olivia</td> <td>156,000</td></tr>

<tr ><td>5</td>

Using the regex "^<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*" how do I extract the names only using re.findall to compile a list?

Thank you in advance.

Karilyn Lee
  • 37
  • 1
  • 4
  • Does it need to be a regular expression? Couldn't something like `'Jacob' in line or 'Emily' in line`, where `line` is an individual line in the file, suffice? – dddJewelsbbb Dec 04 '19 at 01:48
  • @dddJewelsbbb Yes this needs to be done in regular expression. Supposedly it could be done in one line of code. – Karilyn Lee Dec 04 '19 at 02:38

1 Answers1

1

Method 1

I guess, you can simply call that using your expression, or maybe a bit modified version of that, such as with:

^\h*<td\s*>([^<\r\n]+)<\/td\s*>.*<td\s*>([^<\r\n]+)<\/td\s*>

RegEx Demo

Test 1

import re

string = '''
<tr >

     <th style="text-align:right; background-color:white; color:black" scope="col">Rank</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male name">Name</th>

     <th style="text-align:right; background-color:white; color:black" scope="col" abbr="male number">Number</th>

     <th style="text-align:right; background-color:white; color:black"  scope="col" abbr="female name">Name</th>

     <th style="text-align:right; background-color:white; color:black"  abbr="female number">Number</th>

   </tr>

   </thead>

   <tbody>

<tr ><td>1</td>

  <td >Jacob</td> <td>273,844</td> <td >Emily</td> <td>223,690</td></tr>

<tr ><td>2</td>

  <td >Michael</td> <td>250,554</td> <td >Madison</td> <td>193,152</td></tr>

<tr ><td>3</td>

  <td >Joshua</td> <td>231,926</td> <td >Emma</td> <td>181,257</td></tr>

<tr ><td>4</td>

  <td >Matthew</td> <td>221,513</td> <td >Olivia</td> <td>156,000</td></tr>

<tr ><td>5</td>
'''

print(re.findall(r'<td\s*>([a-zA-Z]+)<\/td\s*>.*<td\s*>([a-zA-Z]+)<\/td\s*>.*', string))

Output 1

[('Jacob', 'Emily'), ('Michael', 'Madison'), ('Joshua', 'Emma'), ('Matthew', 'Olivia')]

If you wish to simplify/update/explore the expression, it's been explained on the top right panel of regex101.com. You can watch the matching steps or modify them in this debugger link, if you'd be interested. The debugger demonstrates that how a RegEx engine might step by step consume some sample input strings and would perform the matching process.


Method 2

Maybe, a better approach would be to use bs4 though:

Test 2

import re
from bs4 import BeautifulSoup

f = open('/path/to/your/filename.txt', 'r+b')
names = []
try:
    soup = BeautifulSoup(f.read(), 'html.parser')
finally:
    f.close()

    for l in soup.find_all('td'):
        if re.match(r'\D+', l.text):
            names.append(l.text)
print(names)

Output 2

['Jacob', 'Emily', 'Michael', 'Madison', 'Joshua', 'Emma', 'Matthew', 'Olivia']
Emma
  • 27,428
  • 11
  • 44
  • 69