1

I need to shorten a given text (with different encodings!) - eg. to 140 characters - without touching the links.

Example:

Lorem ipsum dolor sit amet: http://bit.ly/111111 Consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat. http://bit.ly/222222 Sed diam voluptua. At vero eos et accusam et justo duo dolores. http://bit.ly/111111

Should end up as:

Lorem ipsum dolor sit amet: http://bit.ly/111111 Consetetur sadipscing elitr, sed diam nonumy... http://bit.ly/222222 http://bit.ly/111111

My actual code with examples is here: http://phpfiddle.org/lite/code/er7-sty

function shortenMessage($message,$limit=140,$encoding='utf-8') {
  if (mb_strlen($message,$encoding) <= $limit) return $message;
  echo '<pre><h3>Original message:<br />'.$message.'<hr>';
  # search positions of links
  $reg_exUrl = "/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
  preg_match_all ($reg_exUrl, $message, $links,PREG_OFFSET_CAPTURE);
  echo 'Links found:<br />';
  var_dump($links[0]);
  echo '<hr>';
  $position = array();
  $len = 0;
  # search utf-8 position of links
  foreach ($links[0] as $values) {
    $url = $values[0];
    $offset = $values[1];
    #$pos = mb_strpos($message, $url, $offset, $encoding); # doesnt work
    $pos = mb_strpos($message, $url, 0, $encoding);
    $position[$pos] = $url;
    # delete url from string
    $message = str_replace($url, '', $message);
    $len += mb_strlen($url,$encoding); # sum lenght of urls to cut from maxlenght
  }
  echo 'UTF-8 Positions:<br />';
  var_dump($position);
  echo '<hr>';
  # shorten text
  $maxlenght = $limit - $len - 7; # 7 is a security buffer
  while ($maxlenght < 0) { # too many urls? then cut some...
    array_shift($position);
    $len -= mb_strlen($position[0],$encoding);
    $maxlenght = $limit - $len - 6;
  }
  echo 'UTF-8 Positions shortened:<br />';  
  var_dump($position);
  echo '<hr>';
  $message = mb_substr($message,0,$maxlenght,$encoding).'... ';
  echo 'Shortened message without urls:<br />'; 
  var_dump($message);
  echo '<hr>';
  # re-insert urls at right positions
  $addpos = 0;
  foreach ($position as $pos => $url) {
    $pos += $addpos;
    if ($pos < mb_strlen($message,$encoding)) {
      $message = mb_substr($message,0,$pos,$encoding).$url.mb_substr($message,$pos,mb_strlen($message),$encoding);
    } else {
      $message .= ' '.$url;
    }
    $addpos += mb_strlen($url,$encoding);
  }
  echo 'Shortened message:<br />';
  var_dump($message); 
  echo '<hr>';
  return $message;
}

It works, when there are only different links in the text, but fails, when one link is duplicate.

I've already tried to take the position from preg_match_all as offset for the mb_strpos, but I thinks this fails, because of the preg-match-utf8-problem.

I've seen Shortening text tweet-like without cutting links inside already, but they didn't take care of the encoding and deal with html tags...

Community
  • 1
  • 1
Petra
  • 565
  • 1
  • 7
  • 20
  • What's the "preg-match-utf8-problem"? – deceze Oct 29 '13 at 13:34
  • See here: http://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php – Petra Oct 29 '13 at 13:53
  • Not sure what the "problem" is there. The offset that `preg_match` captures is the, well, *offset*. A *string offset* is consistently defined in PHP as the *nth byte*. The OP expects it to mean the *nth character*, but that's a wrong expectation. – deceze Oct 29 '13 at 14:09
  • Ok - but that didn't help me to solve the problem right now. Or how do I calculate the character offset needed for mb_substr from the byte-offset? – Petra Oct 29 '13 at 14:56
  • That was answered there too: http://stackoverflow.com/a/1725329/476 – deceze Oct 29 '13 at 15:03
  • Sorry - I didn't get that. Could you provide an example? In my case eg. the preg_match gives me a position 158 for a link. The mb_strpos with utf-8 gives me a 138. So when I use the 158 as offset for strpos that would not work, as mb_strpos uses a character count - not a byte count. – Petra Oct 29 '13 at 15:15
  • Exactly. And the above answer explains how to convert a *byte offset* into a *character offset*. – deceze Oct 29 '13 at 19:40
  • Did you check my answers? – kwelsan Oct 30 '13 at 10:16
  • Yes - but the don't fit the question - even after edit. – Petra Oct 31 '13 at 17:15

2 Answers2

0

Think I've found a solution - perhaps it helps someone. When links are used twice I just take the last position from mb_strpos as offset - so I've no trouble with byte count...

function shortenMessage($message,$limit=140,$encoding='utf-8') {
    if (mb_strlen($message,$encoding) <= $limit) return $message;
    # search positions of links
    $reg_exUrl = "/(http|https)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
    preg_match_all ($reg_exUrl, $message, $links);
    $position = array();
    $len = 0;
    # get position of links depending on encoding
    foreach ($links[0] as $url) {
        $offset = 0;
        $keys = array_keys($position, $url);
        if ($keys) { # url was already used - take offset in advance
            $lastpos = end($keys);
            $offset = $lastpos + 1;
        }
        $pos = mb_strpos($message, $url, $offset, $encoding);
        $position[$pos] = $url;
    }
    # delete urls from string 
    foreach ($position as $url) {
        $message = str_replace($url, '', $message);
        $len += mb_strlen($url,$encoding); # sum lenght of urls to cut from maxlenght
    }
    # shorten text
    $maxlenght = $limit - $len - 7; # 7 is a security buffer
    while ($maxlenght < 0) { # too many urls? then cut some... 
        $key = min(array_keys($position)); 
        $len -= mb_strlen($position[$key],$encoding);
        $maxlenght = $limit - $len - 6;
        unset($position[$key]);
    }
    $message = mb_substr($message,0,$maxlenght,$encoding).'... ';

    # re-insert urls at right positions
    $lasturl = '';
    foreach ($position as $pos => $url) {
        if ($pos < mb_strlen($message,$encoding)) {
            $message = mb_substr($message,0,$pos,$encoding).$url.mb_substr($message,$pos,mb_strlen($message),$encoding);
        } elseif ($url != $lasturl) { # avoid adding the same url at the end
            $message .= ' '.$url;
        }
        $lasturl = $url;
    }
    return $message;
}
Petra
  • 565
  • 1
  • 7
  • 20
-1

Try this code:

$string = 'Lorem ipsum dolor sit amet: http://bit.ly/111111 Consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat. http://bit.ly/222222 Sed diam voluptua. At vero eos et accusam et justo duo dolores. http://bit.ly/222222';

$regex = '/https?\:\/\/[^\" ]+/i';
preg_match_all($regex, $string, $matches);
print_r($matches[0]);

UPDATED ANSWER

<?php
$string = 'Lorem ipsum dolor sit amet: http://bit.ly/111111 Consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat. http://bit.ly/222222 Sed diam voluptua. At vero eos et accusam et justo duo dolores. http://bit.ly/222222';
echo "Original String";
echo "<hr>";
echo $string;
$matched_string = preg_split('/https?\:\/\/[^\" ]+/i', $string);
echo "<br />";
echo "<br />";
echo "<br />";
echo "<br />";
echo "Shorten String";
echo "<hr>";
preg_match_all('/(https?\:\/\/[^\" ]+)/i', $string, $matched_url);
$urls = $matched_url[0];
$formatted_str = '';
for($i=0; $i< count($urls); $i++){
    if(strlen($matched_string[$i]) > 40){
        $formatted_str .= substr($matched_string[$i], 0, 40).'...'.$urls[$i];
    } else {
        $formatted_str .= $matched_string[$i].$urls[$i];
    }
}
echo $formatted_str;
?> 

ANOTHER SOLUTION [USED CSS TO SHORTEN TEXT LENGTH]

<?php
$string = 'Lorem ipsum dolor sit amet: http://bit.ly/111111 Consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat. http://bit.ly/222222 Sed diam voluptua. At vero eos et accusam et justo duo dolores. http://bit.ly/222222';
echo "Original String";
echo "<hr>";
echo $string;
echo "<br />";
echo "<br />";
echo "<br />";
echo "<br />";
echo "Shorten String";
echo "<hr>";
$formatted_str = preg_replace('/(https?\:\/\/[^\" ]+)/i', "</span><span>$1</span></div><div><span class=\"shorten\">", $string);
?> 
<html>
    <head>
        <style type="text/css">
            .shorten{
                background-color: #f00;
                text-overflow: ellipsis;
                width:300px;
                overflow: hidden;
                white-space:nowrap;
                float: left;
            }
            span{float: left}
        </style>
    </head>
<body>
    <div><span class="shorten"><?php echo $formatted_str; ?></span></div>
</body>
</html>
kwelsan
  • 1,229
  • 1
  • 7
  • 18
  • And this does what exactly? – deceze Oct 29 '13 at 13:33
  • by using the above code she will get all the urls from the string and after that she can use them in anyway... – kwelsan Oct 29 '13 at 13:35
  • I think the "use them in anyway" part is the part that's the problem here. – deceze Oct 29 '13 at 13:36
  • I think she can loop the matched urls to display... or if she wants to show on page than use CSS with style "text-overflow:ellipsis" that will automatically truncate the string.. – kwelsan Oct 29 '13 at 13:39