1

I've got some problems trying to delete from my string a subsequence \u000.

Firstly, I read bytes [] from my file into string by String str = new String(bytes, "UTF8"); then I get the str which equals \u0004Word which means 4Word. 4 is length of word Word. So now I need to convert it to regular 4Words. replaceAll("\u000", "");, replaceALL("\\\\u000", "") etc doesn't work. How to do that?

void FillingStorage() throws Exception{
    Path path = Paths.get(System.getProperty("db.file"));//that's my file
    byte[] data = Files.readAllBytes(path);
    String str = new String(data, "UTF8");
    System.out.println(str);
    String res = str.replaceAll("I don't know what to write here cos nothing I've tried works");
    return;
}

UPDATE! Firstly, I fill my HashMap with Key -> Value and Key1 -> Value1. Then I write it in file as bytes. So when I try to convert it back to string and print it I see: Key Value Key1 Value1 instead of 3Key 5Value 4Key1 6Value1. But suprisingly if you look at string that I print you will see smth like that: \u0003Key \u0005Value etc... so looks like that my string contains these numbers but java can't print them.

This is how I write my bytes in file:

DataOutputStream stream = new DataOutputStream(new FileOutputStream(System.getProperty("db.file"), true));
    for (Map.Entry<String, String> entry : storage.entrySet()) {
        byte[] bytesKey = entry.getKey().getBytes(StandardCharsets.UTF_8);
        stream.write((int)bytesKey.length);//it disappears!
        stream.write(bytesKey);
        byte[] bytesVal = entry.getValue().getBytes(StandardCharsets.UTF_8);
        stream.write((Integer)bytesVal.length);//disappears too!
        stream.write(bytesVal);
    }
    stream.close();
Maxim Gotovchits
  • 729
  • 3
  • 11
  • 22
  • What you see when you print `str`? I am asking because I doubt that there is `\u000` in it since you claim that `replaceALL("\\\\u000", "")` doesn't work. Or maybe you forgot to store result of `replaceAll` in `str` reference (strings are immutable, so original string is not changed by `replaceAll` method, but new string is created and returned). – Pshemo Oct 09 '14 at 17:43
  • could you paste your replaceAll code line? – Debasis Oct 09 '14 at 17:44
  • I see ` Words` with 1 space before the word. – Maxim Gotovchits Oct 09 '14 at 17:45
  • '\u000' is not the same as "\\\\u0000", the former is a single character and the latter is a String with more than a single character in it. – Luiggi Mendoza Oct 09 '14 at 17:45
  • `'\u000'` is an illegal Unicode escape. There have to be four digits after the `\u`, like `'\u0000'` or `'\u0004'`. – David Conrad Oct 09 '14 at 17:49
  • @DavidConrad yes, there's \u0004. – Maxim Gotovchits Oct 09 '14 at 17:51
  • 1
    Unrelated, but you should use `new String(data, StandardCharsets.UTF_8)` instead to avoid the `UnsupportedEncodingException` which can't actually happen with UTF-8. – David Conrad Oct 09 '14 at 17:52
  • do you want to get only the ascii text and remove everything else? you might look at this question http://stackoverflow.com/questions/8519669/replace-non-ascii-character-from-string – Debasis Oct 09 '14 at 17:55
  • 1
    Can the string be over 127 characters long? If there was an extraneous character `\u0080` or greater at the beginning of the string, it would cause problems interpreting the data as UTF-8. You need to remove the length before you convert it to a string. – David Conrad Oct 09 '14 at 17:55
  • @DavidConrad I get the same string =( – Maxim Gotovchits Oct 09 '14 at 17:55
  • Of course you get the same string, but the overload of the constructor that takes a `Charset` doesn't throw an `UnsupportedEncodingException`. – David Conrad Oct 09 '14 at 17:56
  • @DavidConrad I found out the problem. Question is updated. – Maxim Gotovchits Oct 09 '14 at 18:21
  • @Debasis Question is updated – Maxim Gotovchits Oct 09 '14 at 18:23
  • That's weird, you [write](http://docs.oracle.com/javase/7/docs/api/java/io/DataOutputStream.html#write(int)) the length out before the key or the value, and it disappears, but when you read the data back in, there are these weird extra bytes whose values happen to correspond exactly to the lengths of the strings. – David Conrad Oct 09 '14 at 18:31
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/62774/discussion-between-maxim-gotovchits-and-david-conrad). – Maxim Gotovchits Oct 09 '14 at 18:35

1 Answers1

1

First of all, your requirement does not call for regular expressions, so you should have used replace() instead.

Second, \uxxxx is character literal syntax in Java, so it is not exactly clear that you actually have the characters \ u 0 0 0 in your string; it would be much more logical that your byte array simply starts with the single byte equal to 4, which is the string length.

In that case you should simply discard the initial byte from the array when converting to String, using the constructor which accepts offset and len arguments.

If you happen to indeed have all those chars in the string, again simply using substring to get rid of the initial 6 characters should be all you need.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436