-1

I am scraping a website and would like to get the content inside a specific tag. The tag I'd like to get the content inside is: <pre class="js-tab-content"></pre>

Here is my code:

request = urllib.request.Request(url=url)
response = urllib.request.urlopen(request)
content = response.read().decode()

tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content)

print(tab)

When I print tab I get an empty list []

Here is the content I am searching in:

.... <pre class="js-tab-content"><i></i><span>Em</span>              <span>D</span>              <span>Em</span>             <span>D</span>

Lift M
ac Cahir Og your face, brooding o'er the old disgrace 

     <span>Em</span>                  <span>D</span>                       <span>G</span>-<span>D</span>-<span>Em</span>     

That black Fitzwilliam stormed your place and drove you to the Fern.

<span>Em</span>              <span>D</span>           <span>Em</span>                         <span>D</span>

Gray said victory was sure, soon the firebrand he'd secure

<span>Em</span>                <span>D</span>          <span>G</span>-<span>D</span>-<span>Em</span>

Until he met at Glenmalure, Feach Mac Hugh O'Byrne 



Chorus:

<span>G</span>                                <span>D</span>

Curse and swear, Lord Kildare, Feach will do what Feach will dare

<span>G</span>                               <span>G</span>-<span>D</span>-<span>Em</span>

Now Fitzwilliam have a care, fallen is your star low

<span>G</span>                                       <span>D</span> 

Up with halbert, out with sword, on we go for by the Lord

<span>G</span>                               <span>G</span>-<span>D</span>-<span>Em</span>

Feach Mac Hugh has given his word: Follow me up to Carlow 



From Tassagart ____to Clonmore flows a stream of Saxon Gore

Great is Rory Og O'More at sending loons to Hades.

White is sick and Lane is fled, now for black Fitzwilliams head

We'll send it over, dripping red, to Liza and her ladies



See the swords of Glen Imayle flashing o'er the English Pale

See all the children of the Gael, beneath O'Byrne's banners

Rooster of the fighting stock, would you let an Saxon cock

Crow out upon an Irish rock, fly up and teach him manners

</pre> ....

I do not see why this is returning an empty list instead of a string inside the list with the content inside.

I have looked around the internet for around a half an hour and couldn't find any help.

Sorry if I look stupid here, if it is so obvious!

Anyway, thanks in advance!

David
  • 923
  • 1
  • 9
  • 11
  • 2
    Don't use regular expressions to parse HTML. See here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454 – bgporter Feb 06 '16 at 15:23
  • 2
    Ok, there 2 obvious things here: 1) parsing HTML with regex is a bad idea, 2) `.` does not match a newline in Python regex by default (add `flags=re.S`). A not so obvious thing: lazy dot matching pattern is known to slow down your app when matching huge chunks of text, so, I'd recommend using BeautifulSoup or any other HTML parsing library for Python. – Wiktor Stribiżew Feb 06 '16 at 15:28
  • And... that fixed my problem. Wow, I didn't realize that! I think I see why regex is bad with html. I also know that there won't be other attributes in the tag or anything like that, and the tags inside doesn't matter. – David Feb 06 '16 at 15:30
  • Html is not a [regular language](https://en.m.wikipedia.org/wiki/Regular_language) is really what it comes down to. – OneCricketeer Feb 06 '16 at 15:35

2 Answers2

5

Okay, to add to the comments, here is how you can use BeautifulSoup HTML Parser to extract the pre text in this case:

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
print(soup.find("pre", class_="js-tab-content").get_text())
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you for your help. I decided to use xfx's answer as I will be using re not containing html in the program aswell. Thanks anyway! – David Feb 06 '16 at 15:37
2
tab = re.findall(r'<pre class="js-tab-content">(.*?)</pre>', content, re.S)

re.S is required for . to match newline characters.

xfx
  • 1,918
  • 1
  • 19
  • 25
  • Still, a bad idea to use `.*?`. You should unroll it, and you would not even need `re.S`. – Wiktor Stribiżew Feb 06 '16 at 15:49
  • @AndreaCorbellini, fixed. It's a force of habit, I prefer the use re.M to indicate the purpose is for multiple line.It's not necessary here. – xfx Feb 06 '16 at 15:58