1

I'm just trying to get some data from a webpage like this one:

[ . . . ]

<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>

[ . . . ]

I would like to have a python array like the following one:

myArrayWebPage = ["Lorem Ipsum 01","Lorem Ipsum 02","Lorem Ipsum 03","Lorem Ipsum 04","Lorem Ipsum 05"]

This is my python script:

import urllib.request

urlAddress = "http:// ... /" # my url address
getPage = urllib.request.urlopen(urlAddress)
outputPage = getPage.read()
print(outputPage)

How can I get the array from "outputPage"?

Joe Hunter
  • 11
  • 2

1 Answers1

1

This appears to do what you want:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> html = '''<p class="special-large">Lorem Ipsum 01</p>
<p class="special-large">Lorem Ipsum 02</p>
<p class="special-large">Lorem Ipsum 03</p>
<p class="special-large">Lorem Ipsum 04</p>
<p class="special-large">Lorem Ipsum 05</p>'''
>>> import re
>>> re.findall('<p class="special-large">([^<]+)</p>', html)
['Lorem Ipsum 01', 'Lorem Ipsum 02', 'Lorem Ipsum 03', 'Lorem Ipsum 04', 'Lorem Ipsum 05']
>>> 

Please note that regular expressions are typically not preferred for something like this. You should use a library like Beautiful Soup instead.

Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117
  • Thank you! Can I ask you what do you mean for "regular expressions"? – Joe Hunter Feb 24 '17 at 14:22
  • You can click on the term now, and a Wikipedia article will show up. Next time, try searching Google for a term you are not familiar with. – Noctis Skytower Feb 24 '17 at 14:36
  • @JoeHunter Please take this opportunity to read the wildly entertaining answers on why regexes are insufficient to parse HTML: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Marcus Vinícius Monteiro Feb 24 '17 at 14:52