-1

I have the string like:

TEST FURNITURE-34_TEST>

My requirement is to remove all those junk characters from the above string. so my expected output will be:

TEST FURNITURE-34_TEST

I have tried the below code

public static String removeUnPrintableChars(String str) {
    if (str != null) {
        str = str.replaceAll("[^\\x00-\\x7F]", "");
        str = str.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");
        str = str.replaceAll("\\p{C}", "");
        str = str.replaceAll("\\P{Print}", "");
                    
        str = str.substring(0, Math.min(256, str.length()));
        str = str.trim();
        if (str.isEmpty()) {
            str = null;
        }
    }
    return str;
}

But it does nothing. Instead of finding and replacing each character as empty, can anyone please help me with the generic solution to replace those kinds of characters from the string?

GURU Shreyansh
  • 881
  • 1
  • 7
  • 19
kaviya .P
  • 469
  • 3
  • 11
  • 27
  • 6
    What's your definition of a junk vs. non-junk character? – tgdavies Jul 19 '21 at 10:22
  • Seems like you looking for [String.trim](https://docs.oracle.com/javase/7/docs/api/java/lang/String.html#trim()) method. – Victor Gubin Jul 19 '21 at 10:26
  • 1
    At least for your example, you could cut off the `String` after the desired substring. – deHaar Jul 19 '21 at 10:28
  • 1
    There appears to be some seriosly broken html encoder somewhere in the environment that produced that input. At the very least, it got applied multiple times for already encoded input. – Hulk Jul 19 '21 at 10:28
  • Side note: please reconsider the method name. None of the characters you want to remove is 'unprintable'. They are just various layers of encoded representations for the ampersand `&` character, i.e. `&` and `&`, see also [this question](https://stackoverflow.com/questions/2141799/amp-or-38-what-should-be-used-for-ampersand-if-we-are-using-utf8-in-xht) – Hulk Jul 19 '21 at 10:35
  • 1
    You need to describe the distinction between what you are trying to preserve and what you are trying to retain. It may be enough to split the String on the ampersand character and declare everything in the first token to be good and everything else bad. – vsfDawg Jul 19 '21 at 10:38
  • Acutally the given input probably started out as a single `>` character at the end, which got escaped to `>`, and things escalated from there. If I counted correctly, it got encoded 48 times ^^ – Hulk Jul 19 '21 at 10:43
  • As a quick and dirty 'solution', you could decode this in a loop until the input no longer changes. – Hulk Jul 19 '21 at 10:45
  • Please clarify if you would want the likely original string `TEST FURNITURE-34_TEST>` in this case. – Hulk Jul 19 '21 at 10:48
  • @Hulk Thanks for your reply . I want only up to TEST FURNITURE-34_TEST. Needs to remove all the encoded characters – kaviya .P Jul 19 '21 at 10:51
  • Do you know whether they can only appear at the end? For instance, could you have a string "FOO&ampBAR" and want "FOOBAR"? Might you have a string "FOOamp;amp;" without the "&"? – tgdavies Jul 19 '21 at 10:54
  • @tgdavies Not only at the end., as you said it may contains "FOO&ampBAR" – kaviya .P Jul 19 '21 at 10:57
  • `strg = strg.replaceAll("&|amp;", " ").replaceAll("\\s+", " ").split("#")[0].trim();` – DevilsHnd - 退職した Jul 19 '21 at 11:25
  • One could answer the question as stated by just taking the substring of the the input from `0 to n` where `n` is the first occurrence of `&`. You need more detail which includes an exact description/list (without using regex) of junk characters. – WJS Jul 19 '21 at 12:57
  • @DevilsHnd Thanks for the solution. But its returns null when I have string like "#38TEST FURNITURE-34_TEST>" – kaviya .P Jul 19 '21 at 13:05

3 Answers3

0

Simple way to split a string :

public class Trim {
public static void main(String[] args) {
    String myString = "TEST FURNITURE-34_TEST&"
            + "amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;#38;amp;amp;"
            + "#38;amp;#38;gt;";
    String[] parts = myString.split("&");
    String part1 = parts[0];
    System.out.println(parts[0]);
}
}

Link to original thread : How to split a string in Java

  • See @kaviya .P's last comment -- they could have &amp in the middle of their string. – tgdavies Jul 19 '21 at 11:13
  • My bad. If its the same regex you could just StringBuilder.replaceAll("amp;", ""). Though its probably not the cleanest way of solving this. Seems like the issue has more to do with whatever is generating the string – CharlieWhisky Jul 19 '21 at 11:27
0

The sample strings you are presenting (within your post and in comments) are rather ridiculous and in my opinion, whatever is generating them should be burned....twice.

Try the following method on your string(s). Add whatever you like to have removed from the input string by adding it to the 2D removableItems String Array. This 2D array contains preparation strings for the String#replaceAll() method. The first element of each row contains a Regular Expression (regex) of a particular string item to replace and the second element of each row contains the string item to replace the found items with.

public static String cleanString(String inputString) {
    String[][] removableItems = {
                                 {"(&?amp;){1,}", " "}, 
                                 {"(#38);?", ""}, 
                                 {"gt;", ""}, {"lt;", ""}
                                };
    
    String desiredString = inputString;
    for (int i = 0; i < removableItems.length; i++) {
            desiredString = desiredString.replaceAll(removableItems[i][0], 
                                                     removableItems[i][1]).trim();
    }
    return desiredString;
}
DevilsHnd - 退職した
  • 8,739
  • 2
  • 19
  • 22
0

You can use this method. This is work with marking word boundaries.

    public static String removeUnPrintableChars(String str) {
    if(str != null){
        str = str.replaceAll("(\\b&?\\w+;#?)", "");
    }

    return str;
}