1

I want to remove all special characters from input text as well as some restricted words.

Whatever the things I want to remove, that will come dynamically

(Let me clarify this: Whatever the words I need to exclude they will be provided dynamically - the user will decide what needs to be excluded. That is the reason I did not include regex. restricted_words_list (see my code) will get from the database just to check the code working or not I kept statically ),

but for demonstration purposes, I kept them in a String array to confirm whether my code is working properly or not.

public class TestKeyword {

    private static final String[] restricted_words_list={"@","of","an","^","#","<",">","(",")"};

    private static final Pattern restrictedReplacer;

    private static Set<String> restrictedWords = null;

    static {

        StringBuilder strb= new StringBuilder();

        for(String str:restricted_words_list){
            strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
        }

        strb.setLength(strb.length()-1);
        restrictedReplacer = Pattern.compile(strb.toString(),Pattern.CASE_INSENSITIVE);

        strb = new StringBuilder();    
    }

    public static void main(String[] args)
    {
        String inputText = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";
        System.out.println("inputText : " + inputText);
        String modifiedText = restrictedWordCheck(inputText);
        System.out.println("Modified Text : " + modifiedText);

    }

    public static String restrictedWordCheck(String input){
        Matcher m = restrictedReplacer.matcher(input);
        StringBuffer strb = new StringBuffer(input.length());//ensuring capacity

        while(m.find()){
            if(restrictedWords==null)restrictedWords = new HashSet<String>();
            restrictedWords.add(m.group());  //m.group() returns what was matched
            m.appendReplacement(strb,""); //this writes out what came in between matching words

            for(int i=m.start();i<m.end();i++)
                strb.append("");
        }
        m.appendTail(strb);
        return strb.toString();
    }
}

The output is :

inputText : abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg

Modified Text : abcd abc@ cbda ssef jjj the gg wh&at gggg ss%ss ### (()) DhD

Here the excluded words are of and an, but only some of the special characters, not all that I specified in restricted_words_list


Now I got a better Solution:

    String inputText = title;// assigning input 
    List<String> restricted_words_list = catalogueService.getWordStopper(); // getting all stopper words from database dynamically (inside getWordStopper() method just i wrote a query and getting list of words)
    String finalResult = "";
    List<String> stopperCleanText = new ArrayList<String>();

    String[] afterTextSplit = inputText.split("\\s"); // split and add to list

    for (int i = 0; i < afterTextSplit.length; i++) {
        stopperCleanText.add(afterTextSplit[i]); // adding to list
    }

    stopperCleanText.removeAll(restricted_words_list); // remove all word stopper 

    for (String addToString : stopperCleanText)
    {
        finalResult += addToString+";"; // add semicolon to cleaned text 
    }

    return finalResult;
Alejandro Galera
  • 3,445
  • 3
  • 24
  • 42
Rajesh Hatwar
  • 1,843
  • 6
  • 39
  • 58
  • It is doing exactly what you're asking it to do.. what would be your expected `Modified Text`? – Octoshape Nov 26 '13 at 12:48
  • no it's not does see : input : abc@ output : abc@ even though restricted_words_list have '@' and if i give anly special character then it wont work like wise... pleas check by executing code if possibal – Rajesh Hatwar Nov 26 '13 at 13:17
  • Should the `###` be removed as well? Or only single instances of the `restricted_words_list`? – Octoshape Nov 26 '13 at 13:26
  • see what ever the words restricted_words_list have that should not present in result i.e in modified-text (what ever it takes) – Rajesh Hatwar Nov 26 '13 at 13:30
  • Try my solution from below, I put an answer there. – Octoshape Nov 26 '13 at 15:11

5 Answers5

1
public String replaceAll(String regex,
                         String replacement)

Replaces each substring of this string (which matches the given regular expression) with the given replacement.

Parameters:

  • regex - the regular expression to which this string is to be matched
  • replacement - the string to be substituted for each match.

So you just need to provide replacement parameter with an empty String.

Rann Lifshitz
  • 4,040
  • 4
  • 22
  • 42
RiadSaadi
  • 391
  • 2
  • 7
  • 16
0

You may consider to use Regex directly to replace those special character with empty ''? Check it out: Java; String replace (using regular expressions)?, some tutorial here: http://www.vogella.com/articles/JavaRegularExpressions/article.html

Community
  • 1
  • 1
David Lau
  • 230
  • 2
  • 8
  • sorry let me clear this. What ever the words i need to exclude that will come dynamically i.e user will decide what need to exclude that's the reason i not included regx. restricted_words_list (see my code) will get from database just to check the code working or not i kept statically – Rajesh Hatwar Nov 26 '13 at 13:13
0

You can also do like this :

    String inputText = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg<g ss%ss ### (()) D^h^D";        
    String regx="([^a-z^ ^0-9]*\\^*)";        
    String textWithoutSpecialChar=inputText.replaceAll(regx,"");
    System.out.println("Without Special Char:"+textWithoutSpecialChar);

    String yourSetofString="of|an";   // your restricted words.      
    String op=textWithoutSpecialChar.replaceAll(yourSetofString,"");
    System.out.println("output : "+op);

o/p :

Without Special Char:abcd abc cbda ssef of jjj the gg an what gggg ssss   h

output : abcd abc cbda ssef  jjj the gg  what gggg ssss   h
SeeTheC
  • 1,560
  • 12
  • 14
  • what ever the words i need to exclude that will come dynamically i.e user will decide what need to exclude that's the reason i not included regx. restricted_words_list (see my code) will get from database just to check the code working or not i kept statically – Rajesh Hatwar Nov 26 '13 at 13:11
  • Not a problem . You create the regx dynamically taking words from DB then use that regx to replace it – SeeTheC Nov 27 '13 at 04:42
0
String s = "abcd abc@ cbda ssef of jjj t#he g^g an wh&at ggg (blah) and | then";

String[] words = new String[]{ " of ", "|", "(", " an ", "#", "@", "&", "^", ")" };
StringBuilder sb = new StringBuilder();
for( String w : words ) {
    if( w.length() == 1 ) {
        sb.append( "\\" );
    }
    sb.append( w ).append( "|" );
}
System.out.println( s.replaceAll( sb.toString(), "" ) );
stan
  • 984
  • 10
  • 15
  • what ever the words i need to exclude that will come dynamically i.e user will decide what need to exclude that's the reason i not included regx. restricted_words_list (see my code) will get from database just to check the code working or not i kept statically – Rajesh Hatwar Nov 26 '13 at 13:11
0

You should change your loop

for(String str:restricted_words_list){
        strb.append("\\b").append(Pattern.quote(str)).append("\\b|");
}

to this:

for(String str:restricted_words_list){
        strb.append("\\b*").append(Pattern.quote(str)).append("\\b*|");
}

Because with your loop you're matching the restricted_words_list elements only if there is something before and after the match. Since abc@ does not have anything after the @ it will not be replaced. If you add * (which means 0 or more occurences) to the \\b on either side it will match things like abc@ as well.

Octoshape
  • 1,131
  • 8
  • 26