-2

I have the following links to be extracted:

[{"file":"https:\/\/www.rapidvideo.com\/loadthumb.php?v=FFIMB47EWD","kind":"thumbnails"}], 
    "sources": [
        {"file":"https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
         "label":"Standard (288p)","res":"288"},
        {"file":"https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4"

I would like to extract the links ending in mp4

My regex is as follows:

"file":"(https\:.*?\.mp4)"

However, I matches are wrong as the first link ending in a php is matched. I am practising here Pythex.org. How do I avoid the first link? The link to the html page I am trying to parse is https://www.rapidvideo.com/e/FFIMB47EWD

Echchama Nayak
  • 971
  • 3
  • 23
  • 44

1 Answers1

2

Why even use regular expressions? This looks like a JSON object/Python dict, you could just iterate through it and use str.endswith.

>>> sources = {
...     "sources": [
...         {"file": "https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4",
...          "label": "Standard (288p)","res":"288"},
...         {"file": "https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4",
...          "label": "Standard (288p)","res":"288"}
...     ]
... }
>>> for item in sources['sources']:
...     if item['file'].endswith('.mp4'):
...         print(item['file'])
... 
https:\/\/www588.playercdn.net\/85\/1\/e_q8OBtv52BRyClYa_w0kw\/1496784287\/170512\/359E33j28Jo0ovY.mp4
https:\/\/www726.playercdn.net\/86\/1\/q64Rsb8lG_CnxQAX6EZ2Sw\/1496784287\/170512\/371lbWrqzST1OOf.mp4

EDIT:

It looks like that link is available in a video tag after the javascript has loaded. You could use a headless browser but I just used selenium to fully load the page and then save the html.

After you have the full page html, you can parse it using BeautifulSoup instead of regular expressions.

Using regular expressions to parse HTML: why not?

from bs4 import BeautifulSoup
from selenium import webdriver


def extract_mp4_link(page_html):
    soup = BeautifulSoup(page_html, 'lxml')
    return soup.find('video')['src']


def get_page_html(url):
    driver = webdriver.Chrome()
    driver.get(url)
    page_source = driver.page_source
    driver.close()
    return page_source


if __name__ == '__main__':
    page_url = 'https://www.rapidvideo.com/e/FFIMB47EWD'
    page_html = get_page_html(page_url)
    print(extract_mp4_link(page_html))
G_M
  • 3,342
  • 1
  • 9
  • 23