-1

I am working on collecting data from twitter and making processing on it, but i have the problem that: text is dirty,

example :

String dirtyText="this*is#a*&very_dirty&String";

example :

String dirtyText="All f dis happnd bcause u gave ur time, talent n passion.";

please i want it as simple as possible.

3 Answers3

0

This is not an easy problem to solve. All f dis happnd could be "cleaned" to produce All *of* this happened or All *if* this happened. For the first example, you can merely replace all non-alphabetic characters with spaces. See this question for how to do that.

Otherwise I think you would need a natural language processor, or at the very least a spell checker. To guess what a Tweet should be in correct english is an extremely complex problem to solve. Take a look at Jazzy for an open source spell checker.

Community
  • 1
  • 1
Samuel
  • 16,923
  • 6
  • 62
  • 75
0

public class CleaningDirtText { /* * remove leading and trailing spaces, and split our words into a String array. * The split method allows you to break apart text on a given delimiter. In this * case, we chose to use the regular expression \W, which represents anything * that is not a word character: / private static final String dirtyText = "thisis#a*&very_dirty&String";

public static void main(String[] args) {
    System.out.println(dirtyText);
    String[] words =    dirtyText.toLowerCase().trim().split("[\\W\\d]+");
    // System.out.println(preparedText);
    //String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+");
    for (String clean : words) {
        System.out.print(clean + " ");
    }
}

}

  • Why are you posting your [answer](https://stackoverflow.com/a/68839324/6670491) twice? Can you [edit](https://stackoverflow.com/posts/68839282/edit) and explain why your code can solve OP's issue? – HardcoreGamer Aug 19 '21 at 06:08
0

public class CleaningDirtText { private static final String dirtyText = "thisis#a&very_dirty&String";

public static void main(String[] args) {
    /*
     * remove leading and trailing spaces, and split our words into a String array.
     * The split method allows you to break apart text on a given delimiter. In this
     * case, we chose to use the regular expression \\W, which represents anything
     * that is not a word character:
     */

    System.out.println(dirtyText);
    String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+");
    for (int i = 0; i < words.length; i++) {
        System.out.print(words[i]);
    }
    System.out.println("\nsee the cleand text:-");
    for (String clean : words) {
        System.out.print(clean + " ");
    }
}

}