105

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.

I will delimit them using a character.

Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.

Rahul
  • 18,271
  • 7
  • 41
  • 60
  • 67
    Please don't be embarrased. You should ignore all the people who say "ooh, that's a crap way, do this instead". It's not for responders to question why, it for them to answer how. I don't care why you're in this position. I've been in a few myself. Good luck! – Iain Holder Jan 29 '09 at 15:43
  • 2
    I had this same issue..and I went with PIPE before googling or stack overflowing...because I liked the way it looked---|----like a skinney person. –  Apr 17 '12 at 00:46
  • 1
    It depends on the kind of text. Some kinds of text rarely uses tab characters so I often go with that. But other kinds of text including source code often does use it. Can't you do some stats on your source text? Can't you add escape characters into your source text and thereby use anything you like as delimiter? – hippietrail Aug 17 '12 at 14:07
  • 1
    not asking & not trying is much worse than being embarrassed of asking any kind of question. I am here for the answer of the same question and I am proud of myself that I have some other people sharing same problem with me :) – Teoman shipahi Aug 09 '13 at 16:42
  • 1
    For those who might have a `|` in their text, I actually had such a case where I needed to keep characters down to a minimum as much as possible. Since most fields where strings with interesting text, CSV didn't work due to to much escaping. Our field delimiter is `/|`. The slash is only moderately common but paired with a pipe you never run into it. I've been using a engine that gets a lot of data passed through it every day. This has never broken, and I've never needed to encapsulate a single string, or escape special char. On average, this mechanism has saved us a few percentage of text. – RLH Oct 29 '13 at 15:58

17 Answers17

65

I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)

In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.

ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group).  These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record.  The roughly map to fields in modern nomenclature.

Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.

If you must display it, I would recommend displaying it in-application, after it was parsed into fields.

sadolit
  • 27
  • 1
  • 5
Edwin Buck
  • 69,361
  • 7
  • 100
  • 138
  • 5
    wow, thank you. this was exactly what I was searching for. – Theunis Aug 13 '20 at 10:39
  • I'm doing a console dashboard which has some CSV-like processing in the pipes, and this came like a miracle to my rescue! Thank you! – João Ciocca Oct 24 '22 at 22:05
  • This is great. It basically means that it's possible to write well formatted 4D tables in ASCII, without having to escape anything, and without hoping that user data will not include the separators. User data won't include `0x1C-0x1F`, right? – Eric Duminil Dec 01 '22 at 06:52
  • 1
    @EricDuminil If you santize the user data to only include printable characters, then `0x1c-0x1F` will be excluded (along with a lot of other items). Without some for of data sanitation, creative users could slip in there characters, even though I doubt they normally would. It would be the text equivalent of a SQL injection attack. Note that your 4 dimensions of data doesn't imply that the dimensions would be uniform. One could have a record with more units than the record that came before it, or fewer uints. – Edwin Buck Dec 02 '22 at 01:04
39

Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.

The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.

I personally think I'd go for | (pipe) if given a choice but going with real data is safest.

And whatever you do, make sure you've worked out an escaping scheme!

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134
  • I wouldn't go ridiculing here. In a magento 2 product export they merge a number of attributes into a single column of csv called `additional_attributes`. – tread Mar 09 '17 at 07:38
  • 1
    Why don't you just replace all the tab characters in text with four spaces and use a tab character `\t` as the delimiter? – Elie G. Jan 30 '18 at 16:21
  • There are perfectly good reasons to use something other than CSV: CSV files are not easily compatible with unix tools like cut and awk due to quoted fields. A single character which doesn't occur elsewhere and can be typed can easily be preferable. Recognising this is not "embarrassing". – Chris L. Barnes Mar 16 '21 at 10:30
  • Thanks for the reminder, came to a question about separators and found what I was lokkng for, a ideal escape character for a custom protocol I'm implementing. – Luctins Jun 22 '23 at 14:08
25

When using different languages, this symbol: ¬

proved to be the best. However I'm still testing.

Icarin
  • 427
  • 5
  • 7
  • 1
    I like this idea, but I'm curious if you're able to file containing strings like "Billy"¬"Car"¬"Red"¬"Garage"¬"3" and use cut. (ie. $cut -d"¬" -f1 myfile.delim) – blehman Nov 06 '13 at 19:32
  • I added this question to stack here: http://stackoverflow.com/questions/19821639/using-cut-in-bash-on-a-file-with-a-unique-deliminter – blehman Nov 06 '13 at 20:02
  • the `not sign` with charcode 172 is not ascii, but cp-1252 – milahu May 10 '21 at 06:06
22

Probably | or ^ or ~ you could also combine two characters

SQLMenace
  • 132,095
  • 25
  • 206
  • 225
17

You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.

(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)

If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (@ or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.

Jason S
  • 184,598
  • 164
  • 608
  • 970
  • 14
    The standard ASCII code table does include four control codes specifically designed for this purpose, as mentioned by Jason S above. They are: `28 FS` File Separator, `29 GS` Group Separator, `30 RS` Record Separator, `31 US` Unit Separator. Unfortunately, pretty much no one uses them although that is exactly what they were intended for. Personally I detest CSV format files because so many people don't think things through and make a mess that us programmers have to deal with if we want to support their file formats. – deegee Sep 30 '13 at 22:49
  • 3
    @deegee this is probably the best answer here. Unless the data contains binary or non-standard ascii/unicode then this will always work in any language. You should turn this into a regular answer. – dhj Jun 22 '14 at 20:14
  • @rahul do you have the powers to mark this as the accepted answer ? Most useful when dealing with user input data full of rubbish. Note to others: ALT+31 to get US (0x1F) in Windows. – golfalot Oct 12 '19 at 19:50
16

How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
Alex Fort
  • 18,459
  • 5
  • 42
  • 51
  • I like this better than my idea. +1. – Iain Holder Jan 29 '09 at 15:40
  • I think a comma counts as common character in normal text. If it were as simple as using CSV I doubt there'd be a need to ask the question... – Jay Jan 29 '09 at 15:43
  • csv deals with commas in normal text as well as a few other issues. So it dosn't matter that there is a comma allready in the text. IIRC it puts text in quotes and escapes quotes. – Jeremy French Jan 29 '09 at 15:46
  • @Jeremy: exactly right. Here's a wikipedia article mentioning how the escaping scheme works: http://en.wikipedia.org/wiki/Comma-separated_values – rmeador Jan 29 '09 at 16:28
  • 1
    To put it bluntly: CVS will deal with all those issues which you didn't think of and make sure that you won't have to fix your "solution" every two weeks because it breaks due to some unforeseen input. – Aaron Digulla Jan 29 '09 at 16:40
  • I was assuming (perhaps wrongly) that the data is not escaped and for some reason there's inadequate control over the data source to ensure it will be properly escaped. Otherwise it's always preferable to use an existing library of course. – Jay Jan 29 '09 at 17:59
9

Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.

Jay
  • 41,768
  • 14
  • 66
  • 83
9

For fast escaping I use stuff like this: say you want to concatinate str1, str2 and str3 what I do is:

delimitedStr=str1.Replace("@","@a").Replace("|","@p")+"|"+str2.Replace("@","@a").Replace("|","@p")+"|"+str3.Replace("@","@a").Replace("|","@p");

then to retrieve original use:

splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("@p","|").Replace("@a","@");
str2=splitStr[1].Replace("@p","|").Replace("@a","@");
str3=splitStr[2].Replace("@p","|").Replace("@a","@");

note: the order of the replace is important

its unbreakable and easy to implement

Mohammad Amin
  • 91
  • 1
  • 1
  • 3
    This really is the best answer here, and the only correct one imo. It is the only answer which can't be broken. All other answers only lower the probability of the input breaking the format, but this is a very very poor approach. The selected answer rightly speaks of using an escape scheme like this - but once you do the choice of delimiter is essentially irrelevant. – Alfie Feb 16 '16 at 14:37
  • Delimiter is not quite irrelevant. If you a pick a common character - say a space or the letter "e" - your escaped string is going to get quite long indeed, and hard to read. Best to pick an uncommon character, which is why I still prefer the pipe symbol for this kind of thing. – fool4jesus Jul 25 '16 at 19:09
3

Pipe for the win! |

Eppz
  • 3,178
  • 2
  • 19
  • 26
3

We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.

Joe
  • 41,484
  • 20
  • 104
  • 125
2

Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.

Jackson
  • 5,627
  • 2
  • 29
  • 48
  • The more seldom used broken pipe ¦ (0x7C) were a perfect fit ¦ I had a somewhat broken .csv with a ~10k rows with allot of ordinary "|", pipes, in values and needed to make a fast fix with one (1) character to be able to import into a Google Sheet – K. Kilian Lindberg Mar 09 '22 at 13:46
1

Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.

1

I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.

1

This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.

I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.

CSV is probably a better idea for most situations, though.

Coxy
  • 8,844
  • 4
  • 39
  • 62
  • base64 encode is a simple solution, however the main reason CSV is used is because you don't have to reparse the text, by using base64 you might as well just invent your own format entirely. – rollsch Jan 23 '17 at 03:29
0

I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win

0

make it dynamic : )

announce your control characters in the file header

for example

delimiter: ~
escape: \
wrapline: $
width: 19

hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text

would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text

i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

milahu
  • 2,447
  • 1
  • 18
  • 25
0

I sometimes need to parse a collection of filenames that act delimited information. Or I'm typing up a list in notepad, and want it to be parsable. Commas just aren't a great choice unless you're quoting all values.

I also like it to be typeable from the keyboard if possible. Windows can't do pipes (|), so pipes are out if filename compatibility is needed. Additionally, it would be ideal if it was "web safe". This rules out @, = and # which had some potential (though do show up in text as @name and #tag) as well as $, which also had some viability. Semicolons might seem like a good choice, but are far too common (smilies, and people use it in filenames instead of a colon). % had potential, but is used to URL encode characters like %20 etc.

Backtick is probably the best choice. I almost never see it, and when I do, it is used as an apostrophe, and can be replaced beforehand. But it also is an important character in Markdown, so a backtick-separated list will not play nice. I also like that I don't have to hold shift to type it as well (at least on a US keyboard).

Tilde is a respectable second choice. It's also almost never used, but does see use in certain kinds of "internet speak", so if you're delimiting body text from potential user data, you may want to escape it somehow.

Caret is worth considering as well, though it can sometimes be used in 'internet speak', especially in Asian countries, i.e. ^_^.

Exclamation can definitely show up in grammatical text, but is worth a mention.

If two-character delimiters (or three) can be used, more possibilities are opened.

Using brackets become viable. For example ][, }{, )(. Or you can duplicate the above ones, or mix and match, such as ~^ or ^~.

With three-character delimiters, I like one character between two spaces. For example, Artist - Song title can be reliably split using - . But using other characters like the backtick can also work. Only concern might be typos, such as A ` B `C ` D.

So yeah many viable choices, none of them really 'standard', unless you store the delimiter explicitly in a header.

bryc
  • 12,710
  • 6
  • 41
  • 61