PHP preg_replace with src ../

Question

Orignal Code

<script language="javascript" src="/lta/vrl/scripts/vrlCommons.js"></script>
<script language="JavaScript" src="../scripts/formObjCommons.js"></script>

My Code

$url = "https ://example.com";
$url2 = "https ://example.com/lta/vrl";
$result = file_get_contents('https://example.com', false, $context);
$result = preg_replace('/src="(https:\/\/)?([^"]+)"/', "src=\"$url\\2\"", $result);

How to make that?

  <script language="javascript" src="$url/lta/vrl/scripts/vrlCommons.js"></script>
  <script language="JavaScript" src="$url2/scripts/formObjCommons.js"></script>

score 0 · Answer 1 · edited May 23 '17 at 11:52

If you're going to be visiting random pages on the internet with file_get_contents, and attempt to rewrite the contexts of the page to point back at your domain allowing you to essentially create a proxy browser, you should know there are many malformed webpages out there. Do not attempt to parse HTML with regex as explained here: RegEx match open tags except XHTML self-contained tags

What I would suggest, however, would be to use an HTML parsing engine which can compensate for noise in the HTML, correcting malformed documents, floating angle brackets, and convert problematic characters to entities, finally allowing you to parse the document as an HTML page, much like how JavaScript can walk the nodes of the page.

A PHP library I swear by, and have used successfully on large projects, even with SEO-related content and long documents without running into regex memeory limits, is http://simplehtmldom.sourceforge.net/ Once downloaded, all you need to do is include simple_html_dom.php into your project. Then to use the library with your code, you'd do:
$dom = str_get_html($result); From there, use the DOM-methods mentioned in the manual. First select all elements which you'd like to alter, or all elements *. Then loops through them and check if a src attribute is set. If it is, grab the value of the src, which is its URL, then replace its domain with your domain. To do us, don't use a regular expression, there are many URL structures out there and it can get complicated checking for // meaning use the current scheme, or subdomains so you don't know how many dots to search for, or trying to search for a forward slash, perhaps one won't appear and you'll encounter a ? indicating the query string or # for the hash. Or, to completely blow all your logic out of the water, you might encounter a @ which puts the username followed by a colon then the password followed by the domain... There's a REALLY simple way to do this from PHP, since they have a function design specifically for replacing parts of urls with new ones. The function is http_build_url. Sadly, though, it isn't widely supported, and probably not available on your server. There exists an alternative here which defines it for you, if it doesn't exist. I don't know how reliable it is, but I see it relies on parse_url, the function I would have proposed otherwise. The idea being that you'd parse the URL, grab the host part you want, then reconstruct the URL again manually. But I like http_build_url more, in that the job becomes an easy one-liner.

To test the http_build_url function, you can try:

echo http_build_url('http://google.com/search?q=yay',array('host'=>'example.com'));

Once you got the that working, you should know how to replace the URLs very easily. Then you'll need to use the Simple HTML DOM parsing library I linked you to before to update the src attribute to your new URL.

Once you've made your changes to the DOM document, you'd do:

$result=$dom->save();

Then you'll have the updated document loaded back into the $result string you were working on, and ready to deliver to the user for your what appears to be a proxy browser.

PHP preg_replace with src ../

1 Answers1