python re.findall pattern for different number of matches

Question

<tr>
11:15
12:15
13:15
</tr>

<tr>
18:15
19:15
20:15
</tr>

in this case output should be: [ (11:15, 12:15, 13:15), (18:15, 19:15, 20:15) ]

My pattern: (\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?</tr> will work only if there are 3 hours in each tr tag

But this should work if there are 1-3 hours (in the same format \d\d:\d\d) in each tr tag. Another example. For this my pattern doesn't work anymore.

<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>

Output should be: [ (12:00, 13:00, ), (14:00, 15:00, 16:00), (12:00, , ) ]

And here's another thing: every hour isn't separated by just whitespaces, the real file looks like this: I used [\s\S]*? or [\w\s<>="-/:;?|]*? for this. An hour is either in simple span or in longer form .

example:

<tr>
<span class="na">16:00</span>
<span>|</span><a href="http:/21.28.147.68/msi/default.aspx?event_id=52514&amp;typetran=1&amp;ReturnLink=http://www.kino.pl/kina/przedwiosnie/repertuar.php" class="toolBox" data-hasqtip="true" aria-describedby="qtip-0">20:45</td>
</tr>

Don't tell me you are [using regex to parse html](http://stackoverflow.com/a/1732454/5827958). — zondo, Feb 24 '16 at 21:12
`[\s\S]` matches anything including newline , which `.` doesn't match. — slugo, Feb 24 '16 at 21:16
For the real life example, isn't it be okay for you by just doing: `re.findall('\d\d:\d\d', target_source)`? — Quinn, Feb 24 '16 at 22:11

alecxe · Accepted Answer · 2016-02-24T21:26:18.337

1

I would parse the HTML with an HTML parser, find all tr elements in the table and split the contents or each row using str.split() - it would handle both spaces and newlines. Example using BeautifulSoup parser:

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
    11:15
    12:15
    13:15
    </tr>

    <tr>
    18:15
    19:15
    20:15
    </tr>

    <tr>12:00 13:00</tr>
    <tr>14:00 15:00 16:00</tr>
    <tr>12:00</tr>
</table>"""

soup = BeautifulSoup(data, "html.parser")

result = [row.text.split() for row in soup.table.find_all("tr")]
print(result)

Prints:

[['11:15', '12:15', '13:15'], 
 ['18:15', '19:15', '20:15'], 
 ['12:00', '13:00'], 
 ['14:00', '15:00', '16:00'], 
 ['12:00']]

An hour is either in simple span or in longer form .

This is even better, let's find every inner element inside a tr matching a specific pattern and get the text

[[elm.strip() for elm in row.find_all(text=re.compile(r"\d\d:\d\d"))] 
 for row in soup.table.find_all("tr")]

edited Feb 24 '16 at 21:26

answered Feb 24 '16 at 21:13

alecxe

462,703
120
1,088
1,195

@kierrez could you please edit the question and insert this real html snippet into the question? Thanks. – alecxe Feb 24 '16 at 21:19
@kierrez If you're doing anything at all complicated, you should definitely be using an HTML parser. – Alyssa Haroldsen Feb 24 '16 at 21:22
@kierrez sorry for multiple updates, now you can check the updated code sample. – alecxe Feb 24 '16 at 21:25
wow thanks. I was doing this for 3 days... I never heard about HTML parser, but I will be using it from now on – kierrez Feb 24 '16 at 21:25

score 0 · Answer 2 · answered Feb 24 '16 at 21:19

If you'd prefer regex, you could use this:

found = []
for group in re.findall(r'(\d\d:\d\d.*){1,3}</tr>', data, re.DOTALL):
    found.append(re.findall(r'(\d\d:\d\d)', group, re.DOTALL))
# found == [['12:00', '13:00'], ['14:00', '15:00', '16:00'], ['12:00']]

Quinn · Answer 3 · 2016-02-26T01:46:56.860

0

Try this solution using regex:

import re

input = """
<tr>
11:15
12:15
13:15
</tr>

<tr>
18:15
19:15
20:15
</tr>

<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
"""

print [ re.findall('(\d\d:\d\d)', tr) for tr in re.findall('<tr>([^<]*)</tr>', input)]

Output:

[['11:15', '12:15', '13:15'], 
 ['18:15', '19:15', '20:15'], 
 ['12:00', '13:00'], 
 ['14:00', '15:00', '16:00'], 
 ['12:00']]

edited Feb 26 '16 at 01:46

answered Feb 26 '16 at 01:38

Quinn

4,394
2
21
19

python re.findall pattern for different number of matches

3 Answers3