-1

I've been working on a regex censor for quite the time and can't seem to find a decent way of censoring address links (and attempts to circumvent that).

Here's what I got so far, ignoring escape sequences:

([a-zA-Z0-9_-]+[\\W[_]]*)+(\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)+([\\w]{2,6})((\\.|[\\W]?|dot|\\(\\.\\)|[\\(]?dot[\\)]?)([\\w]{1,4}))*

I'm not so sure what might be causing the problem but however it censors the word "com" and "come" and pretty much anything that is about 3+ letters.

Problem: I want to know how to censor website links and invalid links that are attempts to circumvent the censor. Examples:

Google.com

goo gle .com

g o o g l e . c o m

go o gl e % com

go og le (.) c om

Also a slight addition, is there a possible way to add links to a white list for this? Thank you.

Community
  • 1
  • 1
Agentleader1
  • 41
  • 1
  • 7
  • possible duplicate of [What is a good regular expression to match a URL?](http://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url) – dilix Jun 23 '15 at 15:47
  • 4
    Mandatory quote: 'Some people, when confronted with a problem, think “I know, I'll use regular expressions.' Now they have two problems." -- Jamie Zawinski – technophobia Jun 23 '15 at 15:47
  • @dilix that's not the same question. – kervin Jun 23 '15 at 16:01
  • @kervin Yes, It's not the exactly same question, my mistake. Maybe this link should help topic starter to fix regex according to purpose, because as I understand main problem is to distinguish 'com' and 'come' – dilix Jun 23 '15 at 16:04
  • 1
    @dilix This is not the same question as that, as I want to know regex for a url AND the circumvents. People will attempt to circumvent the censor, and I want a strong way to detect a circumvent and censor it not only _valid_ urls. – Agentleader1 Jun 23 '15 at 17:29

1 Answers1

0

You could use a simple function such as this..

private String hideLink(String link){
    String[] split = link.split("\\.");
    String output = "";
    output += split[0] + ".";
    for(int i = 0; i < split[1].length(); i++){
        output += "*";
    }
    output +=  "." + split[2];
    return output;
}

calling

hideLink("www.google.com");

returns www.******.com

calling

hideLink("www.msn.net");

returns www.***.net

calling

hideLink("http://abc.12345.org");

returns http://abc.*****.org

etc...

Galax
  • 367
  • 1
  • 13
  • Thanks for the help but there are a lot of other domain suffixes that I can not list them all out like that. And not always will someone advertising start with a www. ;) – Agentleader1 Jun 23 '15 at 17:27
  • There is another way to do it that would capture them all, I will write it up for you. – Galax Jun 23 '15 at 17:28
  • Editted the post, hopefully you can get something you need from it – Galax Jun 23 '15 at 17:50