5

I'm saving the recording of a set of sentences to a corresponding set of audio files.

Sentences include:

Ich weiß es nicht!
¡No lo sé! 
Ég veit ekki!

How would you recommend I convert the sentence to a human readable filename which will later be served on an online server. I'm not sure right now as to what languages I might be dealing with in the future.

UPDATE:

Please note that two sentences can't clash with each other. For example:

É bär icke dej.
E bår icke dej.

can't resolve to the same filename as these will overwrite each other. This is the problem with the slugify function mentioned here: Turn a string into a valid filename?

The best I have come up with is to use urllib.parse.quote. However I think the resulting output is harder to read than I would have hoped. Any suggestions?:

Ich%20wei%C3%9F%20es%20nicht%21
%C2%A1No%20lo%20s%C3%A9%21
%C3%89g%20veit%20ekki%21
scrpy
  • 985
  • 6
  • 23
Baz
  • 12,713
  • 38
  • 145
  • 268
  • Is it necessary that you are able to reconstruct the exact original name from the "escaped" file name? Otherwise I suppose you could just add suffixes for collisions... In any case, I know that is not your question, but you may want to consider a more bulletproof solution like using some UUID for the file names and having the associated sentences in a file/database/whatever. I find hard to imagine a rock-solid algorithm able to deal with any kind of Unicode input. – jdehesa Nov 28 '17 at 10:35
  • https://unix.stackexchange.com/questions/38055/utf-8-filenames If this answer is correct, why not write the sentences out exactly as is? If you want to use them for a purpose where it non-ascii characters aren't allowed, you can convert them at that time. – GVH Nov 28 '17 at 10:38
  • 2
    Not sure about your need, but if this concerns translations from, say, English, would naming the files like `_` (e.g. `I don't know_de_DE`) be ok for you? – Joël Nov 28 '17 at 10:41
  • @GVH: non ASCII filenames are a nightmare as soon as you try to exchange them with a different system, not speaking of zip files... – Serge Ballesta Nov 28 '17 at 10:43
  • 1
    Your examples look like valid file names to me. – Stop harming Monica Nov 28 '17 at 10:47

3 Answers3

1

What about unidecode?

import unidecode
a = [u'Ich weiß es nicht!', u'¡No lo sé!', u'Ég veit ekki!']
for s in a:
    print(unidecode.unidecode(s).replace(' ', '_'))

This gives pure ASCII strings that can readily be processed if they still contain unwanted characters. Keeping spaces distinct in the form of underscores helps with readability.

Ich_weiss_es_nicht!
!No_lo_se!
Eg_veit_ekki!

If uniqueness is a problem, a hash or something like that might be added to the strings.

Edit:

Some clarification seems to be required with respect to the hashing. Many hash functions are explicitely designed for giving very different outputs for close inputs. For example, the built-in hash function of python gives:

In [1]: hash('¡No lo sé!')
Out[1]: 6428242682022633791

In [2]: hash('¡No lo se!')
Out[2]: 4215591310983444451

With that you can do something like

unidecode.unidecode(s).replace(' ', '_') + '_' + str(hash(s))[:10]

in order to get not too long strings. Even with such shortened hashes, clashes are pretty unlikely.

piripiri
  • 1,925
  • 2
  • 18
  • 35
  • How might the hash be used to distinguish between "¡No lo sé!" and "¡No lo se!"? – Baz Nov 28 '17 at 11:58
  • Added clarification to my answer. – piripiri Nov 28 '17 at 12:14
  • I need the filenames to be readable as specified in my question. This is because it makes it easy for me to find a file. For example, if there is a file with a glitch in it which I need to fix, I want to easily find this file in a file system. – Baz Nov 28 '17 at 13:05
  • As an alternative to hashes, you could create a string containing only the characters that were replaced, convert them to base64, and append that. Guaranteed to be collision free I think. Although collision is absurdly unlikely to begin with, so whatever. You could also check for the collision beforehand with `os.file.exists()` and omit the unique identifier unless it's necessary. – GVH Nov 29 '17 at 22:28
0

you should probably try to convert spaces into another symbol making your string look like É-bär-icke-dej.

if your using python I would do it like this.

  • Replace spaces with another symbol like (-) or (/)

mystring.replace(' ','-')

  • Detect your character encoding using chardet a python package that detects encoding.

  • Decode your string using pythons


mystring.decode(*the detected encoding*)

  • Check if file name is in your directory already using python's OS package. something like

files = os.listdir(*path to directory*)  
//get how many times the file name has been repeated
redundance = 0
for name in files: if mystring in name: redundance+=1
  • append redundance to your string

if redundance !=0:
    mystring = mystring+redundance

  • Use ur string as a file name!

Hope this helps!

Vincent Pakson
  • 1,891
  • 2
  • 8
  • 17
0

The only disallowed characters in traditional Unix / Linux file names are slash (/ U+002F) and the null character (U+0000). There is no need to convert your example human-readable strings to anything else.

If you need to make the files available to systems which do not use the same file name encoding, such as for downloading over FTP or from a web server, perhaps you want to expose them as explicitly UTF-8. On most modern U*xes, this should be the default out of the box anyway. This would correspond to the results you get from urllib quoting, where the percent-encoding is a safe and reasonably standard way of producing a machine readable and unambigious representation of the encoding. If you embed these in a snippet of HTML or something, you can keep the display text human-readable, and just keep the link machine-readable.

<a href="%C3%89g%20veit%20ekki%21">Ég veit ekki!</a>
tripleee
  • 175,061
  • 34
  • 275
  • 318