1

Used the following to fix the problems (for the remaining issues, will change my code around). Sorry for the improper code format in my initial post.

import csv, re, mechanize  

htmlML = br.response().read() 

#escaping ? fixed the regex match 
patMemberName = re.compile('<a href=/foo.php\?XID=(d+) ><font color=#000000><b>(.*) </b>') 
searchMemberName = re.findall(patMemberName,htmlML)

MembersCsv = 'path-to-csv' 
MemberWriter = csv.writer(open(MembersCsv, 'wb')) #adding b fixed the \n in csv

for i in searchMemberName:
    MemberWriter.writerow(i)
    print (i)

Thank you for your time

k3rb3r05
  • 35
  • 6
  • 1
    In regex, a question mark means "zero or one of the previous character class" and a period means "any character". So, when you say .php?, you're not really looking for .php?. Try [.]php[?] – aleph_null Nov 15 '11 at 04:00
  • If you are using Python2 change `csv.writer(open(MembersCsv, 'w'))` to `csv.writer(open(MembersCsv, 'wb'))` (because Python2 wants csv files opened in binary). Regardless of Python version change `MemberWriter.writerow(i)` to `MemberWriter.writerow([i])` (because `writerow` wants a row of items--currently it's interpreting each character as an item). Finally, do you really need csv if you are going to have only one item per row? – Steven Rumbalski Nov 15 '11 at 04:23
  • 1
    Don't parse html with regexes. See http://stackoverflow.com/questions/2861/options-for-html-scraping. – Steven Rumbalski Nov 15 '11 at 04:27
  • 1
    Also, this question would have been better if it were split into two questions. – Steven Rumbalski Nov 15 '11 at 04:30
  • @aleph_null: thank you for pointing out the obvious... had to escape '?' and '='... will update question (with a new post) – k3rb3r05 Nov 15 '11 at 04:36
  • @Steven Rumbalski: It crossed my mind, but since the problem seemed interlinked, thought to keep it in one post. Will keep it in mind for further inquiries. – k3rb3r05 Nov 15 '11 at 04:40
  • also, MemberWriter.writerow([i]) spits out a "TypeError: 'builtin_function_or_method' object is not subscriptable". 'wb' or just 'w' doesn't seem to make a difference on the output – k3rb3r05 Nov 15 '11 at 04:50
  • don't escape the '=', but do escape the '.' – aleph_null Nov 15 '11 at 05:02
  • @k3rb3r05 If you are getting a `TypeError` you didn't type the parenthesis (`writerow[i]` vs `writerow([i])`. `'wb'` really does make a difference on Python 2.7 (see http://stackoverflow.com/questions/3191528/csv-in-python-adding-extra-carriage-return). If you still have extra carriage returns do `writerow([i.rstrip()])`. `rstrip` will get rid of trailing newlines in your data. – Steven Rumbalski Nov 15 '11 at 05:05
  • 1
    @k3rb3r05, StackOverflow has very good formatting capabilities; please [learn to use them](http://stackoverflow.com/editing-help). I tried to reformat the question myself, but I can't even tell what you were trying for. – Alan Moore Nov 15 '11 at 05:18
  • @StevenRumbalski, you are right. 'wb' does make a difference. I don't get the \n. – k3rb3r05 Nov 15 '11 at 06:50
  • @AlanMoore, I used tags. also tried backticks (as per instructions)... using tags stopped the formatter from asking me to "press ctrl+K" – k3rb3r05 Nov 15 '11 at 07:14
  • Did you try pressing ctrl+K as it asked? Or simply indenting the code blocks four spaces (which is what ctrl-K does)? Seriously, SO's formatting easily beats any other site's that I've seen, and it's really easy to use. Your `` tags just aren't cutting it. – Alan Moore Nov 15 '11 at 07:43
  • The ctrl+k happened when i was editing it, but didn't copy the whole code in there, just 1 line, so when I pressed 'apply' it still popped the error. So went back to tags. Only recently noticed the difference between my post and the answers I received, so went looking... Anyway, it should be fixed now. – k3rb3r05 Nov 15 '11 at 07:49

2 Answers2

0

Unfortunately, I can't find the proper escape sequence for Python right now. Generally, you would wrap an expression with meta-characters that should not be interpreted in "\Q...\E".

Try wrapping your string in re.escape(string). So:

re.compile(re.escape('<font color=#000000><b>(.*)</b>'))
fandingo
  • 1,330
  • 5
  • 21
  • 31
-1

For question 1), you have to escape the ? in the pattern.

import re

htmlML = '<a href=/foo.php?XID=123 ><font color=#000000><b>user</b>'
patMemberID = re.compile('<a href=/foo.php\?XID=(\d*) ><font color=#000000><b>user</b>')

searchMemberID = re.findall(patMemberID, htmlML)
print len(searchMemberID)

for i in searchMemberID:
    print (i)

Then the 123 can be extracted from the string

Question 2a)

You can use (.*?) to replace some string, the ? maens non-greedy match

hzm
  • 420
  • 4
  • 9