In a trie for inserting and searching normal paths, are ascii 1-31 worth considering?

Question

I am working on a trie data structure which inserts and searches for normal paths.

A path can contain any character from unicode, so in order to represent it completely in utf-8, the array in trie needs to contain next nodes for all 256 ascii.

But I am also concerned about the space and insertion time taken by trie. The conditions under which my trie is setup rarely would insert a character of unicode(I mean 128-255 ascii). So I just put an if condition to reject paths which contain above ascii 127. I don’t think the ascii 1-31 are relevant either, although I am unsure about this. As 1-31 chars are like carriage return, esc etc, can I simply continue the loop without inserting them? Like is it possible to encounter paths that are actually differentiable because of ascii 1-31 in a real scenario?

0 is the UTF-8 string terminator. There can't be anything after that. As for 1-31, I don't know. — Nikos C., Jun 09 '19 at 19:35
It's rather unlikely that a path contains any of those characters (especially since you changed your question to only go up to `31`). — Some programmer dude, Jun 09 '19 at 19:36
@Someprogrammerdude, if in case I encounter a path with those, will it be wise to insert that path eliminating that particular character? — Mihir Luthra, Jun 09 '19 at 19:38
For those who also wondered what a trie is: https://en.wikipedia.org/wiki/Trie — klutt, Jun 09 '19 at 19:50
The "worth" of considering those depends on how general your code needs to be. Most people most of the time don't use those characters in file names. When they appear, they're usually the result of an accident. However, they can appear in file names. If your code must work regardless, then you should handle them too. If your code only has to work most of the time, then you don't have to handle them if it makes life harder — but you need to identify how much harder it is. The chances are it is trivial to handle them, for a small cost in space. — Jonathan Leffler, Jun 09 '19 at 20:13
Note that valid UTF-8 cannot contain bytes 0xC0 or 0xC1, nor 0xF5..0xFF. It becomes an interesting question whether file systems can contain names with those bytes in them. And again, whether it is worth ignoring those compared with handling them if they arise is open for discussion. See also [What characters are forbidden in Linux and Windows directory names?](https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names) — Jonathan Leffler, Jun 09 '19 at 20:17
macOS uses character 13 in a file name; when a custom icon is configured for a folder, a hidden file named “Icon^M” is placed in it. — Eric Postpischil, Jun 09 '19 at 20:23
@JonathanLeffler, thanks that link was really helpful. I never noticed macOS has case-insensitive paths. That helps me rule out 26 more characters. Also sadly the cost in space is not small, I am storing the data in mmap(2)’d file which internally implements a Ctrie data struct. The memory consumption isn’t much coz file gets deleted and recreated for each phase. But insertions are time consuming because lots of nodes get inserted and using an array of less size actually improves time taken. — Mihir Luthra, Jun 09 '19 at 20:30
@EricPostpischil, dude! thanks. All this time I thought its carriage return. My code always reported 13 being used, lots of times. Thats the main reason I had to ask this question. — Mihir Luthra, Jun 09 '19 at 20:32
Do you expect to gain any advantage from discriminating some characters for trie insertion/search ? If it is about performance (time or space), I would suggest not doing so, the performance gain, if any, should be marginal. On the other hand, I think a trie that can deal with all cases, and all kind of chars it's defined for, is better than a trie that eventually runs 0.5% faster but will crash or bug on the day an unusual char appears. — m.raynal, Jun 11 '19 at 08:24
@m.raynal, as trie provides fast search, my main goal of using a trie is for speed optimisation. The environment into which I am introducing my data structure makes socket communication to a server which queries database. I am basically caching the received data and also giving a shared memory to all processes that make server calls. So even if the path isn’t found in the data struct, it actually would just mean a cache miss kinda thing. — Mihir Luthra, Jun 12 '19 at 04:33
What do you plan on using the trie for ? Can you share your implementation (there might be some places which provide more room for optimization than the alphabet size) ? — m.raynal, Jun 12 '19 at 07:53

score 0 · Accepted Answer · answered Aug 10 '19 at 15:05

Answering this old question, on macOS ascii 13 is used to represent custom icons which may appear in many paths. Thanks to @EricPostpischil who told that in comments.

All other characters ranging between 1-31 appear pretty less in paths.

Also, macOS users mostly have a case-insensitive path, so generally considering both lowercase and uppercase is also useless.

PS:

Although this question seems to be opinion based, but it actually isn't because it can be answered quite concisely. It attempts to ask for frequency of appearance of characters in paths on macOS. (sorry for the confusing title, I was a noob that time, changing it now will make all comments on it absurd)

In a trie for inserting and searching normal paths, are ascii 1-31 worth considering?

1 Answers1