2

I know RegEx is not the best way to scrape HTMLs, but this is it... I have some something like:

<td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a>  Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td>

And I need to match the Writing and Art parts. But it is not said they're there, and there could be other parts like Ink and Pencils...

How do I do this? I need to use pure RegEx, no additional Python libs.

zondo
  • 19,901
  • 8
  • 44
  • 83
Maurizio
  • 189
  • 1
  • 11
  • 5
    It isn't "not the best way", it isn't a way. If I require you to hammer a nail with a noodle, the failure to accomplish it is my fault, not yours. – msw Jan 09 '11 at 04:41
  • Yeah, right. Wouldn't do that if i wouldn't be somehow forced to do that way... unless you have a suggestion on how to read a html without additional libraries in Python... – Maurizio Jan 10 '11 at 11:15
  • Sorry, didn't mean to sound harsh...I really don't need to read all the tags, just some specific ones, so I think this can be done... I could be wrong though... thanks! – Maurizio Jan 10 '11 at 11:24
  • 2
    Would be sooooo good if, just for once, people weren't admonished for wanting to learn regular expressions. XML parsers are ridiculously heavy-weight for a lot of situations. Imagine admonishing any beginner from learning BASIC or C when they could learn Java or C# instead.. just sheer stupidity. – PP. Jan 10 '11 at 11:52

5 Answers5

2

Maybe there are two patterns to recognise.

  1. your keywords exist within a <td>...</td>
  2. your keywords are followed by a <a>...</a> section

So.. first extract everything within <td>s... (psuedo code)

while ( match( "<td[^>]*>(.*?)</td[^>]*>" ) ) {
    inner = match[1];
    ...
}

The (.*?) means match non-greedily, i.e. match the minimum possible. Otherwise you would match everything from the first <td> to the last </td> (instead of the next </td>).

Then you can move on to processing the inner portion!

PP.
  • 10,764
  • 7
  • 45
  • 59
1
regex = re.compile("(\w+):")
regex.findall(yourString); // returns an array of matching elements

You can test it here

PS: I highly recommend you to go through this

Community
  • 1
  • 1
Mahesh Velaga
  • 21,633
  • 5
  • 37
  • 59
1

I created this eventually:

(Art:|Pencils:|Ink:|Writing:){0,4}.<a href="creator\.php\?c=[^">]*?\"\>(?P<Name>.*?)\</a\>

that looks like it is working... maybe it can be polished a bit. I'm a starter you know.

zondo
  • 19,901
  • 8
  • 44
  • 83
Maurizio
  • 189
  • 1
  • 11
0

You can match optional things in regexs using a ? after the optional part. ? will match either 0 or 1 occurrences of a sub-expression.

Keith Irwin
  • 5,628
  • 22
  • 31
0

Despite my previous answer, I changed my mind and would like NOT to have options/alternate, but get them all. So, this means that whatever is inside the TD tags have to be captured and properly classified. I need to create a capture group optional, so that whatever is the layout, I can still retrieve the content. It should work with this, i.e.:

<td>   Art: <a href="creator.php?c=GPe">Giuseppe Perego</a> </td>
<td> Writing: <a href="creator.php?c=CCh">Carlo Chendi</a>  Art: <a href="creator.php?c=LBo">Luciano Bottaro</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=DKi">Dick Kinney</a> Pencils: <a href="creator.php?c=TS">Tony Strobl</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td>  Pencils: <a href="creator.php?c=JB">Jack Bradbury</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=BKa">Bob Karp</a> Pencils: <a href="creator.php?c=AT">Al Taliaferro</a> Ink: <a href="creator.php?c=AH">Al Hubbard</a> </td>    
<td> Writing: <a href="creator.php?c=DKi">Dick Kinney</a> Pencils: <a href="creator.php?c=TS">Tony Strobl</a> Ink: <a href="creator.php?c=SSt">Steve Steere</a> </td>
<td> Writing: <a href="creator.php?c=VLo">Vic Lockman</a>  Art: <a href="creator.php?c=KWr">Kay Wright</a> </td>
<td> Writing: <a href="creator.php?c=MGa">Michele Gazzarri</a>  Art: <a href="creator.php?c=GPe">Giuseppe Perego</a> </td>

I created:

<td>\ {1,3}(?:(?:Writing: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>).*?)?(?:(?:Pencils: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>\ ))?(?:(?:Ink: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>))?(?:(?:Art: <a href="creator\.php\?c=[^>"]*?">(.*?)?</a>))?\ {1,3}</td>

And it looks like it is working!

I'd really appreciate someone to check and validate my effort.

zondo
  • 19,901
  • 8
  • 44
  • 83
Maurizio
  • 189
  • 1
  • 11
  • as a second thought, i could have simply retrieved the names between tags and then strip them in Python...but i enjoyed! – Maurizio Jan 11 '11 at 00:53
  • The difficulty for you here is dealing with multiple matches. Let's say you have both Writing and Art between `td`s... you will not know which match number to inspect. I would suggest a multiple step process. First, extract everything within the `td`s. Then, inside a loop, match globally (i.e. return one result at a time). But you seem to be picking up the syntax of regular expressions okay. – PP. Jan 11 '11 at 09:16