3

Using php is it possible to remove the http: protocol from an img src?

So img src will be:

<img src="//www.example.com/image.jpg" />

instead of

<img src="http://www.example.com/image.jpg" />

Would str_replace be a good option here? I know I can define:

$contentImg = str_replace(array('http', 'https'), '', $filter);

I'm just not sure how to define $filter.

brandozz
  • 1,059
  • 6
  • 20
  • 38
  • 3
    $filter would be your src string. Where is that coming from? – jlindenbaum Jan 07 '15 at 22:31
  • 1
    Why do you want to do this? – Willem Van Onsem Jan 07 '15 at 22:32
  • 3
    Probably for protocol-relative linking. I've run into trouble with this and http+https mixed servers – Ding Jan 07 '15 at 22:32
  • 1) Check the documentation for [`str_replace`](http://php.net/str_replace), 2) `$filter` would be whatever text you're trying to modify (i.e. your HTML), 3) using `str_replace` is simple, but it might be too simple (i.e. it will butcher URLs like `https://example.com/docs/http/tutorial.html`) – Mr. Llama Jan 07 '15 at 22:32
  • 2
    @CommuSoft Removing the protocol and leaving `//` tells the browser to request the static files using the same protocol as the source page. – Scopey Jan 07 '15 at 22:33
  • you can use trim also, [http://stackoverflow.com/questions/4357668/how-do-i-remove-http-https-and-slash-from-user-input-in-php] – hous Jan 07 '15 at 22:45

2 Answers2

3

Yeah str_replace is where it's at. It would be a protocol-relative link instead.

<?php echo str_replace(array('http:', 'https:'), '', 'http://www.google.com'); ?>

It outputs

//www.google.com

That does as expected. Otherwise you can use preg_replace which will allow you to use regex or regular expressions. CommuSoft posted an answer with a good example.

Ding
  • 3,065
  • 1
  • 16
  • 27
  • 1
    This answer is better than CommuSoft's because it uses str_replace which is far faster than preg_replace. Always use str_replace instead of a regex if you can. – newz2000 Jan 07 '15 at 23:03
  • @newz2000: well a regex can be matched in linear time with the input whereas using `str_replace` depends on the implementation: if it uses string matches, it will require to perform two matches (http and https), so it doesn't scale well for more protocols. Another way is converting it to a regex... Furthermore a small aspect that can become problematic is that it can replace `https:` in the middle of a (corrupt) url. – Willem Van Onsem Jan 07 '15 at 23:17
  • @newz2000: ran a few benchmarks, I don now what your defintion of *far faster* is, but this looks competitive, and (as predicted) once the number of matchers grows, a regex achieves better performance. – Willem Van Onsem Jan 07 '15 at 23:30
  • Keep in mind that this is an extremely simple text replacement. If things were more complex I might not have made the statement. str_replace will be typically 6-20 times faster in simple scenarios like this. – newz2000 Jan 08 '15 at 00:59
  • @newz2000: In a complex testçase, the number of strings will also blow up (and a said before `str_replace` also runs linear with the number of that strings). Can you give a reasonable testcases with a relevant number of strings (say `6+`?). And even for a single instance one must run *Knuth's algorithm*. Knuth's algorithm is a special case of a regex, it is indeed better to do this, but the time complexity is the same, so for large cases, the time complexity is the same. – Willem Van Onsem Jan 08 '15 at 01:05
1

Assuming that $filter works fine and is the source is fetched correctly, you can also use a regular expression replace:

$contentImg = preg_replace('/^https?:/','', $string);

'/^https?:/' is here a regex: - the ^ character means the beginning of a string, such that you only removes potential protocols in the front. - the ? is a special character that specifies that the s is optional. It will thus match both http: and https:.

Using regexes, you can write some queries more compact. Say (for the sake of answer) that you also wish to remove ftp and sftp, you can use:

'/^(https?|s?ftp):/'

Since | means or and the brackets are for grouping purposes.

You also forgot to remove the colon (:).

I'm however more worried that your $filter will contain the entire page source code. In that case, it can do more harm than good since text containing http: can also get removed. In order to parse and process XML/HTML, one better uses a DOMParser. This will introduce some overhead, but as some software engineers argue: "Software engineering is engineering systems against fools, the universe currently produces more and more fools, the small bit of additional overhead is thus justifiable".

Example:

You should definitely use a DOMParser as argued before (since such approach is more failsafe):

$dom = new DOMDocument;
$dom->loadHTML($html); //$html is the input of the document
foreach ($dom->getElementsByTagName('img') as $image) {
    $image->setAttribute('src',preg_replace('/^https?:/','',$image->getAttribute('src')));
}
$html = $dom->saveHTML(); //html no stores the new version

(running this in php -a gives you the expected output for your test example).

Or in a post-processing step:

$html = get_the_content();
$dom = new DOMDocument;
$dom->loadHTML($html); //$html is the input of the document
foreach ($dom->getElementsByTagName('img') as $image) {
    $image->setAttribute('src',preg_replace('/^https?:/','',$image->getAttribute('src')));
}
$html = $dom->saveHTML();
echo $html;

Performance:

Tests were performed about the performance using the php -a interactive shell (1'000'000 instances):

$ php -a
php > $timea=microtime(true); for($i = 0; $i < 10000000; $i++) { str_replace(array('http:', 'https:'), '', 'http://www.google.com'); }; echo (microtime(true)-$timea);  echo "\n";
5.4192590713501
php > $timea=microtime(true); for($i = 0; $i < 10000000; $i++) { preg_replace('/^https?:/','', 'http://www.google.com'); }; echo (microtime(true)-$timea);  echo "\n";
5.986407995224
php > $timea=microtime(true); for($i = 0; $i < 10000000; $i++) { preg_replace('/https?:/','', 'http://www.google.com'); }; echo (microtime(true)-$timea);  echo "\n";
5.8694758415222
php > $timea=microtime(true); for($i = 0; $i < 10000000; $i++) { preg_replace('/(https?|s?ftp):/','', 'http://www.google.com'); }; echo (microtime(true)-$timea);  echo "\n";
6.0902049541473
php > $timea=microtime(true); for($i = 0; $i < 10000000; $i++) { str_replace(array('http:', 'https:','sftp:','ftp:'), '', 'http://www.google.com'); }; echo (microtime(true)-$timea);  echo "\n";
7.2881300449371

Thus:

str_replace:           5.4193 s     0.0000054193 s/call
preg_replace (with ^): 5.9864 s     0.0000059864 s/call
preg_replace (no ^):   5.8695 s     0.0000058695 s/call

For more possible parts (including sftp and ftp):

str_replace:           7.2881 s     0.0000072881 s/call
preg_replace (no ^):   6.0902 s     0.0000060902 s/call
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • using the DOMParser I have foreach($html->find('img[src]') as element) - using this can I then remove the http: and https: using regex? – brandozz Jan 07 '15 at 23:30
  • @brandozz: Yes, but note that you will need to **set the attribute** (using the right calls to the dom parser). Will update answer. – Willem Van Onsem Jan 07 '15 at 23:31
  • the DomParser works when I'm pulling the html from another document. I need to remove the protocol from the images on the same page as the script. Sorry if I didn't make that clear – brandozz Jan 08 '15 at 00:14
  • @brandozz: is that script the terminating script? What you could do is write a `.htaccess` handler that post processes the generate document. – Willem Van Onsem Jan 08 '15 at 00:15
  • CommuSoft - actually I'm using WordPress and I think I may have found the solution: $content = get_the_content(); $content = str_replace(array('http:', 'https:'), '', $content); echo $content – brandozz Jan 08 '15 at 01:41
  • @brandozz: this will also result in replacing other fields with `http:` in and the text as well. See updated example how to process this more safely. – Willem Van Onsem Jan 08 '15 at 01:44
  • CommuSoft - posted my lame solution over at wordpress stackexchange and got some more feedback: http://wordpress.stackexchange.com/questions/174228/remove-the-http-protocol-from-images/174229#174229 – brandozz Jan 08 '15 at 03:44
  • wouldn't preg be resource heavy? – Nabeel Khan May 06 '16 at 16:23
  • @NabeelKhan: You mean like memory? – Willem Van Onsem May 06 '16 at 18:50
  • memory wud be an issue when the string is long, however wouldn't it affect the processor too? – Nabeel Khan May 06 '16 at 20:10
  • @NabeelKhan: The length of the string does not make much difference since it is stored only once. As far as I know PHP matches regexes in linear time which is the same for Knuths string match algorithm, so processing speeds are probably comparable. I also ran benchmark tests as you can find in the answer. The comparison is done in nanoseconds so I guess performance is not really an issue at all. Whether you do it with `str_replace` or `preg_replace`. – Willem Van Onsem May 06 '16 at 20:36
  • Sorry *Knuth* got a bit too much credit. Evidently I mean the [*Knuth-Morris-Pratt algorithm*](https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm). – Willem Van Onsem May 06 '16 at 20:55
  • length of string will affect the memory :) thts wht i meant – Nabeel Khan May 07 '16 at 00:08
  • str_replace can handle strings up to 10 times of preg_replace the max size on same server – Nabeel Khan May 07 '16 at 00:08
  • @NabeelKhan: do you have a link to some benchmarks (with source code). Furthermore note that we are talking about URLs. Do you know a practical url of 80k+ characters? – Willem Van Onsem May 07 '16 at 07:42
  • Finally note that I was talking computationally complexity-wise. In other words [*big oh*](https://en.wikipedia.org/wiki/Big_O_notation). – Willem Van Onsem May 07 '16 at 07:49
  • @WillemVanOnsem replace add 0 to 8000 on line 16 to see the change. https://github.com/nabtron/library-php/blob/master/preg_replace_faster_than_str_replace.php – Nabeel Khan May 07 '16 at 10:25
  • and url might be small, but filtering url in the website output can be an issue – Nabeel Khan May 07 '16 at 10:26
  • and in this test the preg_replace takes less them (provided that the characters are less) – Nabeel Khan May 07 '16 at 10:27
  • @NabeelKhan: The filtering done using a DOM parser which is a linear time push-down automaton. Furthermore you have to test multiple times: if I run the test 800 times (and only test the matching, not the string construction, I get `12.071273 sec` for `preg` and `3.628261 sec` for `str` which looks to me as acceptable. Finally you have to add a start marker since you only want to replace the protocol, not a password in the url (which is separated by a colon as well) ;). – Willem Van Onsem May 07 '16 at 10:38
  • @NabeelKhan: please run this testbench http://pastebin.com/2KnjByFV It is the practical case since you first need to collect urls with the DOM parser. I generated a `80k+` string and using `preg` took 1.641976 seconds whereas `str` took 2.627117 seconds. – Willem Van Onsem May 07 '16 at 10:40
  • The conclusion is, I think, that for all practical cases the time difference is negligible. It takes milliseconds to process such requests. Therefore I would use a regex since it is easy extendible to `ftp`, `sftp`, `ssh`,... – Willem Van Onsem May 07 '16 at 10:44
  • @WillemVanOnsem include "collecting urls from DOM" in the test too then to see how much load it "practically" adds :) , maybe your definition for "practical" differs than mine. i'm not sure if you've worked on sites with millions hits per min – Nabeel Khan May 07 '16 at 10:48
  • @NabeelKhan: that's my entire point why I added *computationally complexity*-wise: it means that the when you would call it with a string of almost infinite length the difference would only scale linearly. That's what computational complexity is all about. :) Nevertheless one has to use a parser because HTML is a *context-free* language so parsing it with a regular expression or string replacement could corrupt the HTML's file content. – Willem Van Onsem May 07 '16 at 10:51
  • Mind however that computation complexity is important. Google asks it in every job interview since they work with (very) big data. Finally note that if you implement a sever stack for 1M+ requests per second, I would definitely not use an interpreted language like PHP, but a compiler-based language like C++ or a JIT-compiled language like C# since a lot of time is wasted by interpreting the language. – Willem Van Onsem May 07 '16 at 10:53
  • btw, i just noticed, tht the results are opposite in php 7.0 and php 5.x, in php 5.x str_replace is faster with same code of mine, while in 7.0 preg_replace is! I've made a post about it: https://nabtron.com/preg_replace-vs-str_replace/ with exact code I used and – Nabeel Khan May 07 '16 at 11:23
  • @NabeelKhan: I think - as said before - in order to do fair comparisons, you have to repeat the experiment in a `for` loop (since otherwise it is very sensible to aspects like *cache faults*. Furthermore constructing the string should *not* be part of the timings. Nevertheless an interesting conclusion. – Willem Van Onsem May 07 '16 at 11:46
  • Furthermore you should be careful using `str_replace` or `preg_replace` on raw html code. Take for instance an article where one says: "The protocol http: this is a protocol*, you do not want to alter the semantics of the article itself I guess? That's why you first need to load it into a parser-tree to make sure you only replace the correct parts. – Willem Van Onsem May 07 '16 at 11:51
  • i understand, but that part is common for both str and preg,only the replace line is diff, but anyway i will still carry out the tests and see – Nabeel Khan May 07 '16 at 11:52
  • @NabeelKhan: yes but it could result in a page swap, cache faults, etc. Furthermore they will make the relative difference less impressive since the common part will generate a constant factor for both results. – Willem Van Onsem May 07 '16 at 11:55
  • i made the test like this, to perform the "practical" test on a setup where text string is first generated dynamically (like most php systems) and then at the end running preg or str on it – Nabeel Khan May 07 '16 at 11:57