Regex to disallow too many sequential, non-whitespace characters - while allowing links

Question

I'm looking to use a regex to replace sequential runs of non-whitespace characters (say more than 35) with only the first 35 characters. I would like to allow strings with "http" in them to remain as they are (so as not to break links).

The strings will be from user input, and if somebody types 50 'x' characters in a row it may go outside of my <DIV> container and disrupt the layout. The runs might come at the beginning of a line or in the middle of one.

E.G. I would like to disallow these types of input:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

12345 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

but not these:

http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

12345 http://somesite.com/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

I got the idea of using a negative lookaround from this question

I'm getting mixed results w/ this regex:

$comment=preg_replace('/^(((?!http).){25})(((?!http).)*)$/imUs', '$1',$comment);

That regex is preserving links, but it is also trimming acceptable text down to 25 characters.

text text text text text text text text text text text text text text text text text text text text text text text text

is becoming

text text text text tex

From reading regex's from other questions, I have a feeling that this can be done with a more elegant regex than I show above. Thanks for any suggestions.

Compeek · Accepted Answer · 2011-04-22T04:08:55.380

1

I came up with this, and some quick testing seems to show it working for me, but let me know if it works correctly for you.

$comment = preg_replace('/(^|\s)((?!http)[^\s"]{25})[^\s"]+/i', '$1$2', $comment);

Obviously replace the 25 with whatever your max length should be.

edited Apr 22 '11 at 04:08

answered Apr 22 '11 at 03:26

Compeek

909
5
13

Pretty close. It does truncate strings with "http" in them though. A link becomes: `ddddddddd – broncozr Apr 22 '11 at 03:48
Wait, could you clarify that? I didn't realize you had HTML tags in there since you didn't have them in the original post. I think that's going to make it much more difficult. :\ – Compeek Apr 22 '11 at 03:50
Sorry about that. I'm most interested in eliminating runs of 25+ that are non-whitespace. Since the `href="http://example.com"` is usually the part of an HTML link that has no whitespace I was focusing on that. I wouldn't touch the `` at the end. – broncozr Apr 22 '11 at 04:02
Okay, I make two small modifications in my answer (added double-quote character in two places), and I think it stops breaking the links in the a tags. Let me know if it's still not working properly. – Compeek Apr 22 '11 at 04:10
Technically the updated pattern has some flaws. It's just that without getting ridiculously complicated to work with HTML tags, it doesn't technically do exactly what you're looking for. It should still give you the output you need, I think. – Compeek Apr 22 '11 at 04:14
I've noticed that the general consensus on SO is that regex+html is a major pain and is to be avoided if possible. Your updated code works perfectly for what I need. Thanks a lot! – broncozr Apr 22 '11 at 04:25
I've noticed the same consensus other places as well. RegEx was just never meant to work with XML, but since it's such a generic tool, it's often the first solution that comes to mind. Anyway, I'm happy to hear it's working for you now. Glad to help! – Compeek Apr 22 '11 at 04:29

Regex to disallow too many sequential, non-whitespace characters - while allowing links

1 Answers1