1

I try to capture an iframe src content that I want to change. I don't have direct access to the HTML, I get it HTML from an API.

You can see some iframe example below:

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">
<iframe allowfullscreen="" frameborder="0" height="276" mozallowfullscreen="" scrolling="no" src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/%20f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490"></iframe>

I have many other type of iframe examples, the only part they have in common is this part of src content https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302

I create the following code to find an element:

// some code
regex_page_embed = r"http.?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/*"
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1)
            print(s1.group())

After that I create more code that I can use and effectively change the HTML using the API, I don't think is necessary to put it here. But when I use:

print(s1)
print(s1.group())

I got the following result:

<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(126, 211), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(686, 771), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/
<_sre.SRE_Match object; span=(227, 312), match='https://fast.player.liquidplatform.com/pApiv2/emb>
https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/

I want to get the last part of the iframe src content. In the example below

<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">

The f2c5f6ca3a4610c55d70cb211ef9d977 is the part that I want.

print(s1) and print(s1.group()) don't show the last part of the src content, how can I get the last part of the iframe src content?

fabiobh
  • 705
  • 2
  • 13
  • 33
  • 1
    In the regex, change the star at the end to `(.*?)(?=\")`. – Quixrick Mar 26 '19 at 18:49
  • Relevant read on parsing html content with regex: https://stackoverflow.com/a/1732454/9183344 – r.ook Mar 26 '19 at 19:34
  • I'd just use `bs4` to parse the iframe and then extract the `src` text content and go from there... – r.ook Mar 26 '19 at 19:36
  • I try to use bs4 first to get the content, but I see that I get more results with regex than bs4. I investigate why this is happening and I find that some iframes are inserted in the page using javascript document.write. This way only regex was able to find it, bs4 can't find it as well. – fabiobh Mar 26 '19 at 19:41
  • Ah right, since it's dynamic contents you should be using a different module like `selenium` or `requests-html`. I'm actually surprised you are able to get the iframe in the `bs4` extracted content at all. – r.ook Mar 26 '19 at 19:46

2 Answers2

1

Use r'<iframe src="[^"]*/([^"]+)"' as the pattern for your search.

Example:

>>> text = """<iframe src="https://fast.player.liquidplatform.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/f2c5f6ca3a4610c55d70cb211ef9d977" webkitallowfullscreen="" width="490">"""
>>> pat = r'<iframe src="[^"]*/([^"]+)"'
>>> search = re.search(pat, text)
>>> search[1]
'f2c5f6ca3a4610c55d70cb211ef9d977'
>>> 
Russ Brown
  • 171
  • 6
  • I edit my question now, I include a second iframe example. I forgot to mention that I have another type of iframes include in the HTML. Your answer will be correct if all iframes are only based in the first iframe example. I have another iframe examples in my page that are completely different from the 2 examples that I provide, the only common part is the iframe src content. – fabiobh Mar 26 '19 at 19:27
1

A better regex for capturing the whole url while having any optional content between <iframe tag and src tag is this,

<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)

Match using this regex and capture your url from group1.

Online Demo

Here is your updated Python code,

regex_page_embed = r'<iframe .*?\bsrc="(https?://fast\.player\.liquidplatform\.com/pApiv2/embed/e50a2b66dc19adc532f288eb4bf2d302/[^"]+)'
soup = BeautifulSoup(page_html, 'html.parser')
page_elements = list(soup.children)
for element in page_elements:
    try:
        s1 = re.search(regex_page_embed, str(element))
        if s1:
            print(s1.group(1)) # extract url using first group
Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36