0

Trying to find link from the following htm data inside hls: with regex.Tried (r"(?<=hls:\s\')(.*)") but it gives partial link https://mvd4.ddns.me:443/1vod5n/almajde-ben-zaher-1 , Any suggestions?

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>RikTak Video Player - Version 1</title>
    <script src="https://cdn.radiantmediatechs.com/rmp/5.2.1/js/rmp.min.js"></script>
    <style>
        body {
            margin: 0;
        }
    </style>
</head>
<body>
<div id="rmpPlayer"></div>
<script>
    var bitrates = {
         hls: 'https://mvd4.ddns.me:443/1vod5n/almajde-ben-zaher-1.mp4/playlist.m3u8?wmsAuthSign=c2VydmVyX3RpbWU9MTAvMjQvMjAxOSA3OjUyOjA2IEFNJmhhc2hfdmFsdWU9WjIxaHNDcTZDMXEzTmM4ZTFTU0RIUT09JnZhbGlkbWludXRlcz02MA=='
    };

        var schedule = {
       preroll: [
            'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'
            ],
        midroll: [

            [600,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'],
            [1200,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'],

            [1800,'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar']
            ],
        postroll: [
            'https://googleads.g.doubleclick.net/pagead/ads?ad_type=video_text_image&client=ca-video-pub-1231661633440980&description_url=https%3A%2F%2Fwww.farfeshplus.com&channel=7962520214&videoad_start_delay=0&hl=ar'
        ]
    };
        var settings = {
        licenseKey: 'Kl8lNHNrNzkyY3M5dj9yb201ZGFzaXMzMGRiMEElXyo=',
        bitrates: bitrates,
        delayToFade: 3000,
        width: 750,
        height: 440,
        skin: 's4',
        hlsJSMaxBufferSize: 0,
        hlsJSMaxBufferLength: 240,
        poster: 'https://www.farfeshplus.com/ramadanimages/1443.jpg',
        ads: true,
        adSchedule: schedule
    };
    var elementID = 'rmpPlayer';
    var rmp = new RadiantMP(elementID);
    rmp.init(settings);
</script>
</body>
</html>
Ibtsam Ch
  • 383
  • 1
  • 8
  • 22
  • 1
    I think it should work right? https://regex101.com/r/FCQL63/1 – The fourth bird Oct 24 '19 at 10:57
  • Maybe you should show the code you tried. As said the previous comment, the regexp works fine. – Amessihel Oct 24 '19 at 11:01
  • Possible duplicate of [What is the best regular expression to check if a string is a valid URL?](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – Kris Oct 24 '19 at 11:08

1 Answers1

0

I would use Beautiful Soup to first parse and obtain the content for the <script> tag. Then, use regex to extract the link you want.

from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')
script = soup.find_all('script')[0]
m = re.search(r"var bitrates = \{\s+hls: '([^']+)'\s+\};", script)
print(m.group(1))

The problem with using regex alone is that you really need a parser here to handle arbitrarily nested HTML content. Regex was not designed for this task.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360