Firestore document with umlaut, two different "ö"

Question

My problem is, when I try to set up a new document in my firestore with a name including umlaut "ö" it writes it in a worse way. Can you compare both documents and tell me what the difference between these two "ö" are? In the first picture the "ö" is bigger than in the second picture. Because of that my further functions - for example search function which is looking for the document name - is not working for document names with umlaut. I can't figure out the answer of my problem. I hope you guys can show me the right way to handle this. I don't want to replace the umlauts.

Should I decode my variable which I pass as the document name in my setup function?

First image:

First Picture

Second image:

Second Picture

Update:

I will explain a little bit more about my goal. I have an index.html upload form for multi-image upload to Firebase storage and writing the imageurl and other information to the |irestore. When I upload my image folder, I retrieve the path of the imagedata from my system and make a split to have only the foldername. I use this name as the document name for my firestore (it is working for folders without an umlaut in the name). But when I write the same name for creating a document through the firebase console or replace it with a variable text = "my string for foldername" it is not matching. I would say the retrieved foldername has a different coding for example for the letter "ö".

 var relpath = files[i].webkitRelativePath;
  folder = relpath.split("/");
  var foldername= "";
  //foldername = unescape(encodeURIComponent(folder[0]));
  foldername = folder[0];
  var storage = firebase.storage().ref().child('kitaDE/duesseldorf/'+foldername+'/'+files[i].name);
  //upload file
  var upload = storage.put(files[i]); //webkitRelatviPath hinzugefügt
  //update progress bar
  upload.on(
    "state_changed",
    function progress(snapshot) {
      var percentage =
        (snapshot.bytesTransferred / snapshot.totalBytes) * 100;
        document.getElementById("progress").value = percentage;
    },
    function error() {
      alert("error uploading file");
    },
    function complete() {
      document.getElementById(
        "uploading"
      ).innerHTML += `${files[i].name} uploaded <br />`;
      
    },
  );

  db.collection("kitaDE").doc(foldername).set({
      image: [],
      id: "",
      active: true,
      title: "",
      street: "",
      zipcode: "",
      location: "",
      })

Update 2

I copy & paste the foldername and my direct entry for the name over the firebase console.

Foldername copied:

Am Köhnen

Entered name in firbase console through my keyboard:

Am Köhnen

It looks for me the same. I run my javascript code and give out the following part on the console log.

  var relpath = files[i].webkitRelativePath;
  folder = relpath.split("/");
  var foldername= "";
  foldername = folder[0];
  var foldername2 = "Am Köhnen";
  var foldername3 = decodeURIComponent(escape(foldername2))

My result is the following screenshot. Console.log Output

You can see that first name seems right, but first and the third output names are not matching. It seems like they are the same but they not, i refer here to my both picture at the begin of my post here. Firestore handle the names different.

To get a hex dump, I ran this command in the parent directory of the problematic one:

bash$ printf '%s\n' Am\K*hnen | xxd
00000000: 416d 204b 6fcc 8868 6e65 6e0a Am Ko..hnen.

Without seeing your source data that provided these strings, there's not much we can do. You have two different strings - you're going to have to figure out where they came from and how to reconcile their differences. — Doug Stevenson, Dec 05 '20 at 23:46
If you can get the hex for each character, you can confirm Frank's answer. — Rick James, Dec 06 '20 at 01:57
Like the comments above already say, show the hex bytes of the problematic character. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors — tripleee, Dec 06 '20 at 06:54
I dont really know how i can deliver the needed information. The source of the both foldernames comes from: 1. Foldername "Am Köhnen" on my mac, self created - 2. Created document over the firebase console "Am Köhnen" or passed variable with string "Am Köhnen. The 1 & 2 are not matching. — devreklim, Dec 06 '20 at 07:06
Copy and paste the literal folder names into your question. Ideally, also add something like the output of `printf '%s\n' Am\ k*hnen | xxd` so we can see the individual hex bytes. — tripleee, Dec 06 '20 at 07:45
@tripleee thanks for your reply. I add more information to my post. Are they the information which are you need? — devreklim, Dec 06 '20 at 10:26
No, an image of text is never appropriate. See also [don’t post images of code or error messages.](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557) And no, we can't guess the encoding from seeing just the rendering. It looks vaguely like the malformed one is [mojibake](https://en.wikipedia.org/wiki/Mojibake) resulting from taking a string which was already UTF-8 and assuming it's in Latin-1 and converting *that* to UTF-8. But only the actual bytes from the actual file names will properly reveal this. — tripleee, Dec 06 '20 at 10:35
Again, if you can `cd`einto the foldereand run `printf '%s\n' Am K*hnen | xxd` we can see the actual bytes in the folder name. — tripleee, Dec 06 '20 at 10:57
I put your commandline and get the following result: 00000000: 416d 0a4b 2a68 6e65 6e0a Am.K*hnen. — devreklim, Dec 06 '20 at 13:19
Sorry, that should be `Am\ K*hnen` withta backslash before the space. — tripleee, Dec 06 '20 at 15:21
@tripleee i got this out put know 00000000: 416d 204b 6fcc 8868 6e65 6e0a Am Ko..hnen. — devreklim, Dec 06 '20 at 16:47
So there you have it; `cc 88` is the UTF-8 encoding of U+0308 and so (at least that part of) Frank's answer is correct. But we still don't know why you think that's somehow incorrect. Probably your code should do Unicode normalization, or there is something more which you are not telling us. The mojibake is inconsistent with this explanation. — tripleee, Dec 06 '20 at 17:04
@tripleee I understand that my foldername on my mac filesystem is uft-8 encoded, great. Thats not suprising me. But when i catch this foldername by the function of webkitrelativpath and put it on the variable and use it for setting a new document in firestore (refer to my source code above) it use a different "ö" regarding the creation of the same name over the firebase console. I cant image that creation over the browser on the firebase console use a different charset? — devreklim, Dec 07 '20 at 06:47
If you can find a way to display the actual hex bytes from your Javascript code too, that should settle it. I'm not familiar enough with Javascript (let alone then React etc) to tell you how to do that, and I can't really reconcile the information we have at this point really. *Maybe* something is getting double-encoded, but the symptoms would typically look different then. — tripleee, Dec 07 '20 at 07:44

Frank van Puffelen · Answer 1 · 2020-12-06T15:41:43.173

1

There are multiple sequences that result in an ö character being displayed. One of them uses a single Unicode codepoint to represent the character (U+00F6), but the other actually uses a separate codepoint for the o and then another one for the umlaut (U+006F U+0308).

Also see:

The wikipedia page on combining characters
The wikipedia list of unicode characters

My first idea is that the two titles in your documents are written with different Unicode sequences.

I thought that Firestore would equate these two ways of writing, but I can't find anything in the documentation about that now. If it doesn't, then that would explain why a query that matches one of the codepoint combinations for ö doesn't match the other combination.

edited Dec 06 '20 at 15:41

answered Dec 06 '20 at 00:20

Frank van Puffelen

565,676
79
828
807

2

If the two are Unicode equivalent, they would norcally also look the same. My suspicion is that something else is wrong (maybe the first one uses the joining diaeresis with a round character which is not a regular **o** perhaps? – tripleee Dec 06 '20 at 06:50
The joining diaeresis is not U+00A8 and the Unicode combining characters go *after* the base character. The proper combining diaeresis is [U+0308](http://www.fileformat.info/info/unicode/char/0308/index.htm) – tripleee Dec 06 '20 at 06:53
Hi Frank, you are maybe right, but i dont want the combining one. I have update my first post, add some code and a explanition from where the names are come from. I hope it brings more light into the problem. – devreklim Dec 06 '20 at 06:57
The code is unimportant; the actual folder names you use is what matters here. – tripleee Dec 06 '20 at 06:58
Nope. The difference isn't in normalization but in [mojibake](https://en.wikipedia.org/wiki/Mojibake) `"Am Köhnen"` vs `"Am KÃ¶hnen"`. Here `Ã¶` = _UTF-8_ byte sequence `0xC3`, `0xB6` interpreted as _latin1_. – JosefZ Dec 06 '20 at 13:21
@JosefZ what should i do then? I would expected that the name of my folder on my mac has the same unicode as the text which i write in to the variable as string? – devreklim Dec 06 '20 at 14:59
@devreklim please cf. notes on `charset` in [this answer](https://stackoverflow.com/a/11142784/3439404) to another question. – JosefZ Dec 06 '20 at 15:30
@tripleee Thanks for the correcting on the diaeresis code. And indeed: the characters in the screenshots look slightly different, which could point to one of them not being a lowercase `o`. – Frank van Puffelen Dec 06 '20 at 15:44
@FrankvanPuffelen Okay Frank we can see the problem of different characters which appear in firestore. But what is the solution to avoid these? Should i change my foldername or my code or add some code for encoding the names in a right way? What will be the next steps which i can do? – devreklim Dec 07 '20 at 07:16
I agree with tripleee. In addition: if a program handles equivalent codes differently, it is not handling Unicode. [This is a requirement of Unicode]. And fonts should handle them correctly – Giacomo Catenazzi Dec 07 '20 at 09:16

Firestore document with umlaut, two different "ö"

1 Answers1