0

I want to capture texts from the below link and save it. http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=44&glossary=0

I need to save only the texts after .A, so I do not need the other texts in the page. Moreover, there are 50 different links at top of the page that I want to get all of the data from all of them.

I have written the below code but it returns nothing, how can specifically get part that I need?

import urllib
import re
htmlfile=urllib.urlopen("http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=CI&version=1&glossary=0")
htmltext=htmlfile.read()
regex='<pre class="glossaryProduct">(.+?)</pre>'
pattern=re.compile(regex)
out=re.findall(pattern, htmltext)
print (out)

I also used the following that returns all the content of the page:

import urllib
file1 = urllib.urlopen('http://forecast.weather.gov/product.php?site=NWS&issuedby=FWD&product=RR5&format=txt&version=1&glossary=0')
s1 = file1.read()
print(s1)

Can you help me to do so?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Behi
  • 47
  • 1
  • 8
  • Heed one of the commandments of modern programming: Do not regex [x/html content](http://stackoverflow.com/a/1732454/1422451) – Parfait Feb 27 '17 at 19:07

1 Answers1

1

Your regex is not capturing anything because your content starts with a newline, and you did not enable your . to include newlines. If you change your compile line to

pattern=re.compile(regex,re.S)

It should work.

Also you may want to look at:

https://regex101.com

It shows you EXACTLY what your regex is doing. When i put the S flag on the right side, it started working exactly as it should:

Image of regex working with the S flag

Andrei T
  • 163
  • 1
  • 8