Parse HTML that is returned as an external link

Question

Update: This is where I based the regex off of..

http://www.phpliveregex.com/p/iLL

I can not figure this out for the life of me. I would think it is simple but apparently not.

On an app I am trying to return a url, that has been parsed on a separate page to avoid crossdomain errors and return that url. The problem is it adds #.m3u8 at the end of the url returned IE.

http://ipaddress/stringHere/stringHere/link.m3u8#.m3u8

This is my code..

This page is where the parsing takes place.

<?php 
header("Access-Control-Allow-Origin: *");
header("Connection: keep-alive");
header("Content-Type: video/vnd.apple.mpegurl");
header("Accept-Ranges: bytes");

?>
<?php
ob_start(); // ensures anything dumped out will be caught
$html = file_get_contents("https://www.example.com/s/");

preg_match_all(
    '/((?:(?:f|ht){1}tp:\/\/)[-a-zA-Z0-9@:%_\+.~#?&\/\/=]+.m3u8)/',
    $html,
    $posts, // will contain the article data
    PREG_SET_ORDER // formats data into an array of posts
);

foreach ($posts as $post) {
    $link = $post[1];


// clear out the output buffer
while (ob_get_status()) 
{
    ob_end_clean();
}

// no redirect
header( "Location: $link" );
}

?>

This is the page that is called by the app and has the link returned.

<?php 
header("Access-Control-Allow-Origin: *");
header("Connection: keep-alive");
header("Content-Type: video/vnd.apple.mpegurl");
header("Accept-Ranges: bytes");
header("Cache-Control: private, max-age=0, must-revalidate");
?>
<html>
<body>
<?php
include ("parse.php");
?>
</body>
</html>

The result is

http://ipAddress/hls/3691308f2a4c2f6983f2880d32e29c84-313391.m3u8#.m3u8

What I really need is for the result to have the ip replaced with a specific one and the removal of the #.m3u8

http://replacedAddress/hls/3691308f2a4c2f6983f2880d32e29c84-313391.m3u8

I hope someone can help. Im pulling my hair out.

To figure a literal dot outside of a character class, you need to escape it. So `.m3u8` must be written `\.m3u8` in the pattern. — Casimir et Hippolyte, Jan 31 '17 at 00:09
If you want to avoid the fragment part of the url, remove `#` in the character class. — Casimir et Hippolyte, Jan 31 '17 at 00:11
A better way to extract urls from an html document is to search links with DOMDocument, then use `parse_url` to obtain the different parts of the url, replace the hostname and rebuild the url without the fragment part. — Casimir et Hippolyte, Jan 31 '17 at 00:16
I thank you so much for the comments. Can I trouble you to give some example code. This is new to me. I have tried DOMDocument a few times but that was just parsing string within . — norcal johnny, Jan 31 '17 at 00:24
see the accepted answer [here](http://stackoverflow.com/questions/4423272/how-to-extract-links-and-titles-from-a-html-page) then look at `parse_url` in the php manual (play with it, make some tries, it isn't difficult). — Casimir et Hippolyte, Jan 31 '17 at 00:31
I will take a look at that now. I honestly do want to figure it out and God knows I have tried. I thank you and will look now. Cheers! — norcal johnny, Jan 31 '17 at 00:33

Parse HTML that is returned as an external link

0 Answers0