0

I'm trying to extract one or more urls from a plain text string in php. Here's some examples

"mydomain.com has hit the headlines again"

extract " http://www.mydomain.com"

"this is 1 domain.com and this is anotherdomain.co.uk but sometimes http://thirddomain.net"

extract "http://www.domain.com" , "http://www.anotherdomain.co.uk" , "http://www.thirddomain.net"

There are two special cases I need - I'm thinking regex, but dont fully understand them
1) all symbols like '(' or ')' and spaces (excluding hyphens) need to be removed
2) the word dot needs to be replaced with the symbol . , so dot com would be .com

p.s I'm aware of PHP validation/regex for URL but cant work out how I would use this to achieve the end goal.

Thanks

Community
  • 1
  • 1
thatguy
  • 797
  • 2
  • 9
  • 17

1 Answers1

4

In this case it will be hard to get 100% correct results. Depending on the input you may try to force matching just most popular first level domains (add more to it):

(?:https?://)?[a-zA-Z0-9\-\.]+\.(?:com|org|net|biz|edu|uk|ly|gov)\b

You may need to remove the word boundary (\b) to get different results.

You can test it here:

http://bit.ly/dlrgzQ

EDIT: about your cases 1) remove from what? 2) this could be done in php like:

 $result = preg_replace('/\s+dot\s+(?=(com|org|net|biz|edu|and_ect))/', '.', $input);

But I have few important notes:

  • This Regex are more like guidance, not actual production code
  • Working with this kind of loose rules on text is wacky for the least - and adding more special cases will make it even more looney. Consider this - even stackoverflow doesn't do that:

http://example.org

but not!

example.org

  • It would be easier if you'd said what are you trying to achieve? Because if you want to process some kind of text that goes somewhere on the WWW later, then it is very bad idea! You should not do this by your own (as you said - you don't understand Regex!), as this would be just can of XSS worms. Better think about some kind of Markdown language or BBCore or else.

Also get interested in: http://htmlpurifier.org/

Ernest
  • 8,701
  • 5
  • 40
  • 51
  • 1
    +1, but you might want to add `[a-z]{2}` as an alternative top level domain to allow international and special domains like `amazon.de`, `apple.tv` etc. (and drop `uk` and `ly` from the list). If you want to match domains like these. – Tim Pietzcker Nov 06 '10 at 10:56