0

I have, in Python:

links = re.match(r'''<A HREF="(\w+?\.htm)#\w*?">''', workbench)

'workbench' is a file read into memory with line separators replaced by spaces; one such file is at: http://pastebin.com/a0LHKXcS

There are some links that don't interest me; they all have lowercase 'a' or 'href'. So far as I can construct, when matched against the file in the pastebin, I should be getting a lot of matches. But so far the re.match() is returning None and not a populated MatchObject I can pull for data. I tried on the command line and cut the regular expression down to be more tolerant of differences, and a search for HREF didn't find anything.

How can I adjust the regular expression (or other factors) so the call gets a populated MatchObject?

Thanks

Christos Hayward
  • 5,777
  • 17
  • 58
  • 113

2 Answers2

6

re.match only tries to match at the start of the string. Use re.search instead.

Apart from that, lazyr is right: even though this particular regular expression works in this particular instance to find particular hits, you are in general much better off relying on an HTML parser such as BeautifulSoup.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
1

Use BeautifulSoup.

>>> import BeautifulSoup
>>> import re
>>> aa = soup.findAll("a", href=re.compile(r".*#.*"))
>>> for a in aa:
...   print a["href"]
... 
npnf214.htm#P5_18
npnf2140.htm#P6_28
npnf2141.htm#P30_306
npnf2142.htm#P257_10476
npnf2143.htm#P273_20869
npnf2144.htm#P322_41638
npnf2145.htm#P424_60362
npnf2146.htm#P453_82389
npnf2147.htm#P506_110748
npnf2148.htm#P514_110857
npnf2149.htm#P522_112870
npnf2110.htm#P538_115696
npnf2111.htm#P553_120011
npnf2112.htm#P561_131414
npnf2113.htm#P593_136014
npnf2114.htm#P681_155628
npnf2115.htm#P719_167167
npnf2116.htm#P743_173304
npnf2117.htm#P768_186497
npnf2118.htm#P839_201234
npnf2119.htm#P891_222702
npnf2120.htm#P941_235400
npnf2121.htm#P993_248248
npnf2122.htm#P1057_267070
npnf2123.htm#P1085_275404
npnf2124.htm#P1111_287892
npnf2125.htm#P1370_306192
>>> 
gsbabil
  • 7,505
  • 3
  • 26
  • 28