0

So, I've been doing some research for a while now and I could't find anything about detecting a URL in a string. The problem is that most results are about detecting whether a string IS a URL, and not if it contains a URL. The 2 results that look best to me are

Regex to find urls in string in Python and Detecting a (naughty or nice) URL or link in a text string

but the first requires http://, which is not something spammers would use (:P) and the second one isn't in regex - and my limited knowledge does not know how to translate any of these. Something I have considered doing is using something dull like

spamlist = [".com",".co.uk","etc"]
for word in string:
    if word in spamlist:  
        Do().stuff()

But that would honestly do more bad than good, and I am 100% sure there is a better way using regex or anything!

So if anyone knows anything that could help me I'd be very grateful! I've only been doing python for 1-2 months and not very intensively during this period but I feel like I'm making great progress and this one thing is all that's in the way, really.

EDIT: Sorry for not specifying earlier, I am looking to use this locally, not website (apache) based or anything similar. More trying to clean out any links from files I've got hanging around.

Community
  • 1
  • 1
  • 1
    Did you consider more advanced methods of detecting spam? Like using an existing mature solution like SpamAssassin? – ivan_pozdeev Sep 19 '14 at 12:27
  • as @ivan_pozdeev mention don't try to re-invent the wheel... as this stuff is really tricky, especially because url without http:// is so permisive – user3012759 Sep 19 '14 at 12:29
  • The solution in [Detecting a (naughty or nice) URL...](http://stackoverflow.com/questions/700163) *is* a regex btw. – ivan_pozdeev Sep 19 '14 at 12:30
  • @ivan_pozdeev this may sound dumb then, but when I tried filling into into a re.findall() it didn't work. Did I do something terribly wrong then? EDIT: And I have looked into spamassassin, but it does not appear to serve my non-website purpose. Sorry for not specifying that; will edit now – user3817979 Sep 19 '14 at 12:33
  • You probably didn't use a [raw string](https://docs.python.org/2/reference/lexical_analysis.html#string-literals) for the regex or escape backslashes in it. – ivan_pozdeev Sep 19 '14 at 12:40
  • @user3817979 [Send simple text (not email) to SpamAssassin](http://stackoverflow.com/questions/4199860/send-simple-text-not-email-to-spamassassin) suggests that SpamAssassin is indeed not tailored to process anything other than e-mail. That's just what came first to my mind. By a look into Wikipedia I ran into [CRM114](https://en.wikipedia.org/wiki/CRM114_%28program%29), which is a further advancement upon the aging Bayesian method. – ivan_pozdeev Sep 19 '14 at 13:06

1 Answers1

0

As I said in the comments,

  • Detecting a (naughty or nice) URL or link in a text string 's solution is a regex and you should probably make it a raw string or escape backslashes in it when using it in Python

  • You really shouldn't reinvent the square wheel here, especially since spam filtering is an arms race domain (couldn't remember the exact English phrase for this)

Community
  • 1
  • 1
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152