RegEx - Match optional groups

Question

I know RegEx is not the best way to scrape HTMLs, but this is it... I have some something like:

<td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a>  Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td>

And I need to match the Writing and Art parts. But it is not said they're there, and there could be other parts like Ink and Pencils...

How do I do this? I need to use pure RegEx, no additional Python libs.

It isn't "not the best way", it isn't a way. If I require you to hammer a nail with a noodle, the failure to accomplish it is my fault, not yours. — msw, Jan 09 '11 at 04:41
Yeah, right. Wouldn't do that if i wouldn't be somehow forced to do that way... unless you have a suggestion on how to read a html without additional libraries in Python... — Maurizio, Jan 10 '11 at 11:15
Sorry, didn't mean to sound harsh...I really don't need to read all the tags, just some specific ones, so I think this can be done... I could be wrong though... thanks! — Maurizio, Jan 10 '11 at 11:24
Would be sooooo good if, just for once, people weren't admonished for wanting to learn regular expressions. XML parsers are ridiculously heavy-weight for a lot of situations. Imagine admonishing any beginner from learning BASIC or C when they could learn Java or C# instead.. just sheer stupidity. — PP., Jan 10 '11 at 11:52

score 2 · Answer 1 · answered Jan 10 '11 at 11:49

Maybe there are two patterns to recognise.

your keywords exist within a <td>...</td>
your keywords are followed by a <a>...</a> section

So.. first extract everything within <td>s... (psuedo code)

while ( match( "<td[^>]*>(.*?)</td[^>]*>" ) ) {
    inner = match[1];
    ...
}

The (.*?) means match non-greedily, i.e. match the minimum possible. Otherwise you would match everything from the first <td> to the last </td> (instead of the next </td>).

Then you can move on to processing the inner portion!

score 1 · Answer 2 · edited May 23 '17 at 11:44

1

regex = re.compile("(\w+):")
regex.findall(yourString); // returns an array of matching elements

You can test it here

PS: I highly recommend you to go through this

edited May 23 '17 at 11:44

Community

1
1

answered Jan 09 '11 at 04:44

Mahesh Velaga

21,633
5
37
59

score 1 · Answer 3 · edited Jul 08 '16 at 04:55

1

I created this eventually:

(Art:|Pencils:|Ink:|Writing:){0,4}.<a href="creator\.php\?c=[^">]*?\"\>(?P<Name>.*?)\</a\>

that looks like it is working... maybe it can be polished a bit. I'm a starter you know.

edited Jul 08 '16 at 04:55

zondo

19,901
8
44
83

answered Jan 10 '11 at 11:17

Maurizio

189
1
11

score 0 · Answer 4 · answered Jan 09 '11 at 04:34

0

You can match optional things in regexs using a ? after the optional part. ? will match either 0 or 1 occurrences of a sub-expression.

answered Jan 09 '11 at 04:34

Keith Irwin

5,628
22
31

score 0 · Accepted Answer · edited Jul 08 '16 at 04:55

Despite my previous answer, I changed my mind and would like NOT to have options/alternate, but get them all. So, this means that whatever is inside the TD tags have to be captured and properly classified. I need to create a capture group optional, so that whatever is the layout, I can still retrieve the content. It should work with this, i.e.:

<td>   Art: <a href="creator.php?c=GPe">Giuseppe Perego</a> </td>
<td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a>  Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=DKi">Dick Kinney</a> Pencils: <a href="creator.php?c=TS">Tony Strobl</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=BKa">Bob Karp</a> Pencils: <a href="creator.php?c=AT">Al Taliaferro</a> Ink: <a href="creator.php?c=AH">Al Hubbard</a> </td>    
<td> Writing: <a href="creator.php?c=DKi">Dick Kinney</a> Pencils: <a href="creator.php?c=TS">Tony Strobl</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=VLo">Vic Lockman</a>  Art: <a href="creator.php?c=KWr">Kay Wright</a> </td>
<td> Writing: <a href="creator.php?c=MGa">Michele Gazzarri</a>  Art: <a href="creator.php?c=GPe">Giuseppe Perego</a> </td>

I created:

<td>\ {1,3}(?:(?:Writing: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>).*?)?(?:(?:Pencils: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>\ ))?(?:(?:Ink: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>))?(?:(?:Art: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>))?\ {1,3}</td>

And it looks like it is working!

I'd really appreciate someone to check and validate my effort.

as a second thought, i could have simply retrieved the names between tags and then strip them in Python...but i enjoyed! — Maurizio, Jan 11 '11 at 00:53
The difficulty for you here is dealing with multiple matches. Let's say you have both Writing and Art between `td`s... you will not know which match number to inspect. I would suggest a multiple step process. First, extract everything within the `td`s. Then, inside a loop, match globally (i.e. return one result at a time). But you seem to be picking up the syntax of regular expressions okay. — PP., Jan 11 '11 at 09:16

RegEx - Match optional groups

5 Answers5