Python, Matching uneven length scraped lists

Question

Please prepare for a long read. I am at a standstill and don't know where to look at for an answer / what else to try. Needless to say I am kinda new to programming. Been hacking away at this project for the past couple of weeks.

PROBLEM

I got this table, 25 lines, 2 columns. Each line is structured like:

Needed event

<td align=center>19/11/11<br>12:01:21 AM</td>
<td align=center><font color=#006633><a href=profiles.php?XID=1><font color=#006633>player1</font></a> hospitalized <a href=profiles.php?XID=2><font color=#006633>player2</font></a></font></td>

NOT Needed event CASE A

<td align="center">19/11/11<br />12:58:03 AM</td>
<td align="center"><font color="#AA0000">Someone hospitalized <a href=profiles.php?XID=1><font color="#AA0000">player1</font></a></font></td>

NOT Needed event CASE B

<td align="center">19/11/11<br />12:58:03 AM</td>
<td align=center><font color=#006633><a href=profiles.php?XID=3><font color=#006633>player3</font></a> attacked <a href=profiles.php?XID=1><font color=#006633>player1</font></a> and lost </font></td>

I have used regex to scrape the needed data. My problem is that the 2 lists are not evently matched. Date and time don't always match to the exact event.

1st ATTEMPT at solving problem

import mechanize  
import re

htmlA1 = br.response().read()

patAttackDate = re.compile('<td align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+ \w+)')
patAttackName = re.compile('<font color=#006633>(\w+)</font></a> hospitalized ')
searchAttackDate = re.findall(patAttackDate, htmlA1)
searchAttackName = re.findall(patAttackName, htmlA1)

pairs = zip(searchAttackDate, searchAttackName)

for i in pairs:
print (i)

But that gets me a wrong time - correct event type of list.

for example:

(('19/11/11', '9:47:51 PM'), 'user1') <- mismatch 
(('19/11/11', '8:21:18 PM'), 'user1') <- mismatch
(('19/11/11', '7:33:00 PM'), 'user1') <- As a consequence of the below, the rest upwards are mismatched 
(('19/11/11', '7:32:38 PM'), 'user2') <- NOT a match, case B
(('19/11/11', '7:32:22 PM'), 'user2') <- match ok
(('19/11/11', '7:26:53 PM'), 'user2') <- match ok
(('19/11/11', '7:25:24 PM'), 'user3') <- match ok
(('19/11/11', '7:24:22 PM'), 'user3') <- match ok
(('19/11/11', '7:23:25 PM'), 'user3') <- match ok

2nd ATTEMPT at solving problem

So thought to strip the newline from the whole page and scrape the table, but:

import mechanize
import re
from BeautifulSoup import BeautifulSoup

htmlA1 = br.response().read()

stripped = htmlA1.replace(">\n<","><") #Removed all '\n' from code

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #this is the table I need to work with

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = re.findall(patAttackDate, table3)
print searchAttackDate

this gives me an error:

return _compile(pattern, flags).findall(string)
TypeError: expected string or buffer

What am I missing?

Bonus question: Is there any way to account for XID being a dynamic variable but bypass it when using regex / beautifulsoup (or other scraping method)? As the project 'grows' I might need to include the XID portion of code but don't want to match to it. (not sure if this is clear)

Thank you for your time

EDIT 1: Added list example
EDIT 2: Made code separation more visible
EDIT 3: Added sample code for a given solution that doesn't seem to work

Test = '''<table><tr><td>date</td></tr></table>'''
soupTest = BeautifulSoup(Test)
test2 = soupTest.find('table')
patTest = re.compile('<td>(.*)</td>')
searchTest = patTest.findall(test2.getText())
print test2 # gives: <table><tr><td>date</td></tr></table> 
print type(test2) # gives: <class 'BeautifulSoup.Tag'>
print searchTest #gives: []

EDIT 4 - Solution

import re
import mechanize
from BeautifulSoup import BeautifulSoup

htmlA1 = br.response().read()
stripped = htmlA1.replace(">\n<","><") #stripped '\n' from html
soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #table I need to work with

print type(table3) # gives <class 'BeautifulSoup.Tag'>
strTable3 = str(table3) #convert table3 to string type so i can regex it

patFinal = re.compile(('(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)</td><td align="center">'
                      '<font color="#006633"><a href="profiles.php\?XID=(\d+)">'
                      '<font color="#006633">(\w+)</font></a> hospitalized <a'), re.IGNORECASE)
searchFinal = re.findall(patFinal, strTable3)

for i in searchFinal:
    print (i)

Sample output

('19/11/11', '1:08:07 AM', 'ID_user1', 'user1')
('19/11/11', '1:06:55 AM', 'ID_user1', 'user1')
('19/11/11', '1:05:46 AM', 'ID_user1', 'user1')
('19/11/11', '1:04:33 AM', 'ID_user1', 'user1')
('19/11/11', '1:03:32 AM', 'ID_user1', 'user1')
('19/11/11', '1:02:37 AM', 'ID_user1', 'user1')
('19/11/11', '1:00:43 AM', 'ID_user1', 'user1')
('19/11/11', '12:55:35 AM', 'ID_user2', 'user2')

EDIT 5 - A much simpler solution (on 1st attempt - without Beautifulsoup)

import re

reAttack = (r'<td\s+align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+\s+\w+)</td>\s*'
            '<td.*?' #accounts for the '\n'
            '<font\s+color=#006633>(\w+)</font></a>\s+hospitalized\s+')

for m in re.finditer(reAttack, htmlA1):
    print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

Sample Output

date: 19/11/11; time: 1:08:07 AM; player: user1
date: 19/11/11; time: 1:06:55 AM; player: user1
date: 19/11/11; time: 1:05:46 AM; player: user1
date: 19/11/11; time: 1:04:33 AM; player: user1
date: 19/11/11; time: 1:03:32 AM; player: user1
date: 19/11/11; time: 1:02:37 AM; player: user1
date: 19/11/11; time: 1:00:43 AM; player: user1
date: 19/11/11; time: 12:55:35 AM; player: user2

No wonder you're having problems. You're using regular expressions on HTML. — Ignacio Vazquez-Abrams, Nov 19 '11 at 21:24
could you clarfiy the rules for the needed cases? since I'm not sure using direct regex are the best way to solve this in python and besides I don't understand your exact date problem — alonisser, Nov 19 '11 at 21:31
I'm just waiting to see how many people ignore the fact that you are using BeautifulSoup and just post this link. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — steveha, Nov 19 '11 at 21:33

score 3 · Answer 1 · answered Nov 19 '11 at 21:37

3

From your description, I haven't yet figured out exactly what you are trying to do. But I can tell you one thing right now: with regular expressions, Python raw strings are your friend.

Try using r'pattern' instead of just 'pattern' in your BeautifulSoup program.

Also, when you are working with regular expressions, sometimes it is valuable to start with simple patterns, verify that they work, then build them up. You have gone straight to complicated patterns, and I'm certain they don't work since you didn't use the raw strings and the backslashes won't be right.

answered Nov 19 '11 at 21:37

steveha

74,789
21
92
117

@chown's answer is useful (complimentary) too! – phlip Nov 19 '11 at 21:52
r'patern' doesn't cut it. Same error. The patterns work, but even if they didn't, it would return [], not an error. – k3rb3r05 Nov 19 '11 at 22:29
I never claimed that raw strings would solve all your problems. But I do claim that the backslash stuff won't work the way you expect unless you use raw strings. You either need to double every backslash, or use raw strings. – steveha Nov 20 '11 at 01:56

chown · Answer 2 · 2011-11-20T00:26:17.823

1

The .findNext() methods will return a BeautifulSoup.Tag object, which cannot be passed to re.findall. Therefore, you need to use .getText() (or a similar method to get the text from the Tag object. Or .contents to get the html inside of that tag). Also, when using re.compile, the returned object is what you need to call findall on.

This:

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%') #this is the table I need to work with

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = re.findall(patAttackDate, table3)

Should be written like this (the last line is the only thing that needs changing):

soup = BeautifulSoup(stripped)

table = soup.find('table', width='90%')
table2 = table.findNext('table', width='90%')
table3 = table2.findNext('table', width='90%')

patAttackDate = re.compile('<td align="center">(\d+/\d+/\d+)<br />(\d+:\d+:\d+ \w+)')
searchAttackDate = patAttackDate.findall(table3.getText())

# or, to search the html inside table3 and not just the text
# searchAttackDate = patAttackDate.findall(str(table3.contents[0]))

BeautifulSoup Documentation

From the re docs:

re.compile(pattern, flags=0)
Compile a regular expression pattern into a regular expression object.

This:
result = re.match(pattern, string)

is equivalent to:
prog = re.compile(pattern)
result = prog.match(string)

edited Nov 20 '11 at 00:26

answered Nov 19 '11 at 21:35

chown

51,908
16
134
170

searchAttackDate = re.findall(patAttackDate, htmlA1), works fine... I have posted 2 different attempts at solving a problem and each one uses different variables. Looks like I need make the code separation more visible. sorry for the mixup – k3rb3r05 Nov 19 '11 at 22:46
@k3rb3r05 Oh, ok. But, in the bottom part of the question, the error: `return _compile(pattern, flags).findall(string) TypeError: expected string or buffer` is because you need to call `findall` like this: `patAttackDate.findall(table3)`. – chown Nov 19 '11 at 22:49
Same error. I am guessing it's the _table3_ (because it's a beautifulsoup by-product)that throws it out of whack. If I use htmlA1, it works fine but then I won't have HTML stripped from _newline_ – k3rb3r05 Nov 19 '11 at 23:03
@k3rb3r05 Hmm, if it still gives the same `TypeError` saying it is expecting a `string or buffer`, then that means `table3` is not a string. Might it be `None`? You can put something like `print type(table3)` right above that call to make sure. – chown Nov 19 '11 at 23:07
didn't think of that... gives me: – k3rb3r05 Nov 19 '11 at 23:13
@k3rb3r05 Ah! Then do `patAttackDate.findall(table3.getText())`. – chown Nov 19 '11 at 23:26
Not sure it works. Sure, it doesn't spit an error but it returns an empty string. Take a look at **EDIT 3** at my original post. – k3rb3r05 Nov 20 '11 at 00:18
1

@k3rb3r05 Sorry, I gave you the wrong method on accident. In your edit3, change `patTest.findall(test2.getText())` to `patTest.findall(str(test2.contents[0]))`. – chown Nov 20 '11 at 00:33
Your solution didn't work as intended. Broke the regex. But you pointed me to the right direction (hence the up-vote) . So I converted table3 to string (_strTable3 = str(table3)_) prior to applying regex on it. (edit 4 on the original post) – k3rb3r05 Nov 20 '11 at 13:53

score 1 · Accepted Answer · answered Nov 20 '11 at 05:54

This works for me:

reAttack = r'<td\s+align=center>(\d+/\d+/\d+)<br>(\d+:\d+:\d+\s+\w+)</td>\s*<td.*?<font\s+color=#006633>(\w+)</font></a>\s+hospitalized\s+'

for m in re.finditer(reAttack, htmlA1):
  print 'date: %s; time: %s; player: %s' % (m.group(1), m.group(2), m.group(3))

live demo

Doing everything in one regex makes for a messier regex, but it's a lot easier than matching each TD separately and trying to sync them up afterward, as you're doing. The .*? near the middle of the regex works on the assumption that all the elements are separated by newlines, as in your original examples. If you can't assume that, you should replace the .*? with (?:(?!/?td>).)* to contain the match within the current TD element.

FYI, there were some inconsistencies in your sample data. Some attribute values were quoted while most were not, and you had a mix of <br> and <br /> tags. I normalized everything for my demo, but if that's representative of your real data, you'll need a much more complicated regex. Or you could switch to a pure DOM solution, which probably would have been easier in the first place. ;)

A much simpler solution. Thanks. BTW, the quoted data and the different tags are result of using _BeautifulSoup_ in the second attempt. — k3rb3r05, Nov 20 '11 at 14:24
Answers the bonus question as well. I was under the impression that parsed data have to be in sequence (ie i couldn't get -date, time, user- without including userID in between) — k3rb3r05, Nov 20 '11 at 14:42

score 0 · Answer 4 · answered Nov 19 '11 at 21:49

0

for the beautifulsoup solution you can use this (without checking the regex - also I'm sure @steveha is right about addin r''):

searchAttackDate = table3.findAll(patAttackDate)
for row in searchAttackDate:
   print row

answered Nov 19 '11 at 21:49

alonisser

11,542
21
85
139

I am getting the following error: _searchAttackDate = table3.findall(patAttackDate) TypeError: 'NoneType' object is not callable_ – k3rb3r05 Nov 19 '11 at 23:00

Python, Matching uneven length scraped lists

4 Answers4