0

Is there a way to use preg_replace() to add a string "utm=some&medium=stuff" at the end of all found urls found in $html_text?

$html_text = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, 
              <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';

So the result should be

href="http://www.me.com" ›››››
href="http://www.me.com?utm=some&medium=stuff"

href="http://www.me.com/page.php?id=1" ›››››
href="http://www.me.com/page.php?id=1&utm=some&medium=stuff"

So, if the url contains a question mark (second url) it should add a ampersand "&" instead of a question mark "?" in front of "utm=some..."

Ultimately it would only alter urls for the domain me.com.

X-Pippes
  • 1,170
  • 7
  • 25
Maria
  • 63
  • 4

5 Answers5

4

This is a little bit tricky, but the following code should work if your URLs are all enclosed in quotation marks (single or double). It will also handle fragment identifiers (like #section-2).

$url_modifier = 'utm=some&medium=stuff';
$url_modifier_domain = preg_quote('www.me.com');

$html_text = preg_replace_callback(
              '#((?:https?:)?//'.$url_modifier_domain.'(/[^\'"\#]*)?)(?=[\'"\#])#i',
              function($matches){
                global $url_modifier;
                if (!isset($matches[2])) return $matches[1]."/?$url_modifier";
                $q = strpos($matches[2],'?');
                if ($q===false) return $matches[1]."?$url_modifier";
                if ($q==strlen($matches[2])-1) return $matches[1].$url_modifier;
                return $matches[1]."&$url_modifier";
              },
              $html_text);

Input:

<a href="http://www.me.com">Lorem</a>
<a href="http://www.me.com/">ipsum</a>
<a href="http://www.me.com/#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php">sit</a>
<a href="http://www.me.com/?">amet</a>,
<a href="http://www.me.com/?foo=bar">consectetur</a>
<a href="http://www.me.com/?foo=bar#section-3">elit</a>.

Output:

<a href="http://www.me.com/?utm=some&medium=stuff">Lorem</a>
<a href="http://www.me.com/?utm=some&medium=stuff">ipsum</a>
<a href="http://www.me.com/?utm=some&medium=stuff#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php?utm=some&medium=stuff">sit</a>
<a href="http://www.me.com/?utm=some&medium=stuff">amet</a>,
<a href="http://www.me.com/?foo=bar&utm=some&medium=stuff">consectetur</a>
<a href="http://www.me.com/?foo=bar&utm=some&medium=stuff#section-3">elit</a>.
r3mainer
  • 23,981
  • 3
  • 51
  • 88
1

You can achieve this by using preg_replace, 2 patterns and two replacememts:

<?php
$add = "utm=some&medium=stuff";
$patterns = array(
                '/(https?:\/\/(?:www)?me\.com(?=.*?\?)[^"]*)/',  # positive lookahead to check if there is a ? mark in url
                '/(https?:\/\/(?:www)?me\.com(?!.*?\?)[^"]*)/' # negative lookahead to check if ? mark is not in
        );
$replacements = array(
                    "$1&".$add, # replacement if first pattern take place
                    '$1?'.$add  # replacement if second pattern take place
            );
$str = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';
$str = preg_replace($patterns, $replacements, $str);
echo $str;

/* Output:
Lorem ipsum <a href="http://www.me.com&utm=some&medium=stuff">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10&utm=some&medium=stuff">consectetur</a> elit.
*/
?>

I liked others answers using DOM-solutions, then I tested the time each snippet takes for the following input:

<a href="http://www.me.com">Lorem</a>
<a href="http://www.me.com/">ipsum</a>
<a href="http://www.me.com/#section-2">dolor</a>
<a href="http://www.me.com/path-to-somewhere/file.php">sit</a>
<a href="http://www.me.com/?">amet</a>,
<a href="http://www.me.com/?foo=bar">consectetur</a>
<a href="http://www.me.com/?foo=bar#section-3">elit</a>.

With microtime:

$ts = microtime(true);
// codes
printf("%.10f\n", microtime(true) - $ts);

That you can see them below (ms):

@squeamish ossifrage:  0.0001089573
@Cobra_Fast:           0.0003509521
@Emissary:             0.0094890594
@Me:                   0.0000669956

That was interesting to me, RegExes done well.

revo
  • 47,783
  • 14
  • 74
  • 117
  • Your solution is indeed very slick! But there seems to be a bug in the domain checking pattern.. This does not work: `'/(https?:\/\/(?:www)?me.com(?=.*?\?)[^"]*)/',` `'/(https?:\/\/(?:www)?me.com(?!.*?\?)[^"]*)/'` This work: `'/(https?:\/\/(?=.*?\?)[^"]*)/',` `'/(https?:\/\/(?!.*?\?)[^"]*)/'` – Maria Oct 27 '13 at 23:14
  • @Maria oh yes. I just forgot to put a backslash before dot : `me\.com` I updated answer. – revo Oct 28 '13 at 08:26
1

This is a trivial task using DOMDocument:

$html_text = 'Lorem ipsum <a href="http://www.me.com">dolor sit</a> amet, <a href="http://www.me.com/page.php?id=10">consectetur</a> elit.';

$html = new DOMDocument();
$html->loadHtml($html_text);

foreach ($html->getElementsByTagName('a') as $element)
{
    $href = $element->getAttribute('href');
    if (!empty($href)) // only edit the attribute if it's set
    {
        // check if we need to append with ? or &
        if (strpos($href, '?') === false)
            $href .= '?';
        else
            $href .= '&';

        // append querystring
        $href .= 'utm=some&medium=stuff';

        // set attribute
        $element->setAttribute('href', $href);
    }
}

// output altered code
echo $html->C14N();

Fiddle: http://phpfiddle.org/lite/code/wvq-ujk

Cobra_Fast
  • 15,671
  • 8
  • 57
  • 102
0

If you'd like to abstract all the nasty parsing away from your script you can always use a DOM parser of which there are many available. For this example I've opted for Simple HTML-DOM as It's the only one I'm actually familiar with (it's admittedly not the most efficient library but you aren't doing anything intensive).

include 'simple_html_dom.php';
$html = str_get_html($htmlString);

foreach($html->find('a') as $a){
    $url = strtolower($a->href);
    if( strpos($url, 'http://me.com')     === 0 ||
        strpos($url, 'http://www.me.com') === 0 ||
        strpos($url, 'http://') !== 0 // local url
    ){
        $url = explode('?', $url, 2);
        if(count($url)<2) $qry = array();
        else parse_str($url[1], $qry);
        $qry = array_merge($qry, array(
            'utm'    => 'some',
            'medium' => 'stuff'
        ));
        $parts = array();
        foreach($qry as $key => $val)
            $parts[] = "{$key}={$val}";
        $a->href = sprintf("%s?%s", $url[0], implode('&', $parts));
    }
}

echo $html;

In this example I've assumed that me.com is your website and that local paths should also qualify. I am also assuming that query strings are likely to be simple key:value pairs. In it's current form, if a URL already has one of your query parameters then it is over-written. If you'd like to retain the existing values then you will need to swap the order of the parameters in the array_merge function.

input:

<a href="http://me.com/">test</a> 
<a href="http://WWW.me.com/">test</a> 
<a href="local.me.com.php">test</a> 
<a href="http://notme.com">test</a> 
http://me.com/not-a-link
<a href="http://me.com/?id=10&utm=bla">test</a>

output:

<a href="http://me.com/?utm=some&medium=stuff">test</a> 
<a href="http://www.me.com/?utm=some&medium=stuff">test</a> 
<a href="local.me.com.php?utm=some&medium=stuff">test</a> 
<a href="http://notme.com">test</a> 
http://me.com/not-a-link 
<a href="http://me.com/?id=10&utm=some&medium=stuff">test</a>
Community
  • 1
  • 1
Emissary
  • 9,954
  • 8
  • 54
  • 65
0

If you have problems with DOMDocument and utf8, try the following:

$html_text = '<p>This is a text with speical chars ÄÖÜ <a 
href="http://example.com/This-is-my-Page" 
target="_self">here</a>.</p>';
$html_text .= '<p>continue</p>';

$html = new DOMDocument('1.0', 'utf-8');

// Set charset-header for DOMDocument
$html_prepared = '<html>'
  . '<head>'
  . '<meta http-equiv="content-type" content="text/html; charset=UTF-8">'
  . '</head>'
  . '<body>'
  . '<div>' . $html_text . '</div>'
  . '</body>';


$html->loadHtml($html_prepared);


foreach ($html->getElementsByTagName('a') as $element)
{
    $href = $element->getAttribute('href');
    if (!empty($href)) // only edit the attribute if it's set
    {
        // check if we need to append with ? or &
        if (strpos($href, '?') === false)
            $href .= '?';
        else
            $href .= '&';

        // append querystring
        $href .= 'utm=some&medium=stuff';

        // set attribute
        $element->setAttribute('href', $href);
    }
}


// 1) Remove doctype-declaration
$html->removeChild($html->firstChild);
// 2) Remove head
$html->firstChild->removeChild($html->firstChild->firstChild);
// 3) Only keep body's first Child
$html->replaceChild($html->firstChild->firstChild->firstChild, $html->firstChild);

print $html->saveHTML();