12

I've got a pipe delimited file that I would like to split (I'm using C#). For example:

This|is|a|test

However, some of the data can contain a pipe in it. If it does, it will be escaped with a backslash:

This|is|a|pip\|ed|test (this is a pip|ed test)

I'm wondering if there is a regexp or some other method to split this apart on just the "pure" pipes (that is, pipes that have no backslash in front of them). My current method is to replace the escaped pipes with a custom bit of text, split on pipes, and then replace my custom text with a pipe. Not very elegant and I can't help but think there's a better way. Thanks for any help.

Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48
Frijoles
  • 121
  • 1
  • 1
  • 3
  • Have you seen [this (monster) thread](http://stackoverflow.com/questions/2148587/regex-quoted-string-with-escaped-quotes-in-c). Not a direct answer, but hopefully a push in the right direction. – dawebber Apr 28 '11 at 04:32
  • What if you want a literal backslash at the end of one of the pieces? – Random832 Apr 28 '11 at 04:42

6 Answers6

9

Just use String.IndexOf() to find the next pipe. If the previous character is not a backslash, then use String.Substring() to extract the word. Alternatively, you could use String.IndexOfAny() to find the next occurrence of either the pipe or backslash.

I do a lot of parsing like this, and this is really pretty straight forward. Taking my approach, if done correctly will also tend to run faster as well.

EDIT

In fact, maybe something like this. It would be interesting to see how this compares performance-wise to a RegEx solution.

public List<string> ParseWords(string s)
{
    List<string> words = new List<string>();

    int pos = 0;
    while (pos < s.Length)
    {
        // Get word start
        int start = pos;

        // Get word end
        pos = s.IndexOf('|', pos);
        while (pos > 0 && s[pos - 1] == '\\')
        {
            pos++;
            pos = s.IndexOf('|', pos);
        }

        // Adjust for pipe not found
        if (pos < 0)
            pos = s.Length;

        // Extract this word
        words.Add(s.Substring(start, pos - start));

        // Skip over pipe
        if (pos < s.Length)
            pos++;
    }
    return words;
}
Jonathan Wood
  • 65,341
  • 71
  • 269
  • 466
  • Yes this is better, parsing `string` on your own way than using `regex`. This runs faster. +1 – KaeL Apr 28 '11 at 04:39
  • If you don't add the words to a `List` and return it, the manual parsing method comes in at about 5 times faster than the regex method. If you add back the overhead of managing a `List`, it's about 3 times faster, on my machine anyway. – Cᴏʀʏ Apr 28 '11 at 14:55
  • See my update... I changed my test and got the Regex implementation down to about 1.6 times slower, but, you still win! – Cᴏʀʏ Apr 28 '11 at 15:20
  • I think this has an issue if the last "word" is blank/empty. I have a file with 37 header column names, but the last element of each row is blank, so the lines end with the pipe "|" but no additional blank space; the words in this case return only 36 – Adam Apr 26 '19 at 00:21
  • I think this may also run into issues when there's an escaped backslash at the end of a field.. such as "data\\|more data|".. dealing with this headache from client data >. – Adam Apr 26 '19 at 00:36
  • @Adam: This is from eight years ago, but one could argue that if a field is empty, it shouldn't be counted. So I'm just saying that how this case is handled depends on the requirements. Shouldn't be hard to modify the code to handle it differently. (And I'd be happy to customize it to different requirements but that would be as a paid consultant. – Jonathan Wood Apr 26 '19 at 00:49
  • thanks @JonathanWood :) I was able to modify it slightly to fit my needs - much appreciated! – Adam May 16 '19 at 04:30
5

This oughta do it:

string test = @"This|is|a|pip\|ed|test (this is a pip|ed test)";
string[] parts = Regex.Split(test, @"(?<!(?<!\\)*\\)\|");

The regular expression basically says: split on pipes that aren't preceded by an escape character. I shouldn't take any credit for this though, I just hijacked the regular expression from this post and simplified it.

EDIT

In terms of performance, compared to the manual parsing method provided in this thread, I found that this Regex implementation is 3 to 5 times slower than Jonathon Wood's implementation using the longer test string provided by the OP.

With that said, if you don't instantiate or add the words to a List<string> and return void instead, Jon's method comes in at about 5 times faster than the Regex.Split() method (0.01ms vs. 0.002ms) for purely splitting up the string. If you add back the overhead of managing and returning a List<string>, it was about 3.6 times faster (0.01ms vs. 0.00275ms), averaged over a few sets of a million iterations. I did not use the static Regex.Split() for this test, I instead created a new Regex instance with the expression above outside of my test loop and then called its Split method.

UPDATE

Using the static Regex.Split() function is actually a lot faster than reusing an instance of the expression. With this implementation, the use of regex is only about 1.6 times slower than Jon's implementation (0.0043ms vs. 0.00275ms)

The results were the same using the extended regular expression from the post I linked to.

Community
  • 1
  • 1
Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194
  • 3
    Assuming that backslashes can also be escaped (e.g. `"This|is|a|pip\\|ed|test (this is a pip|ed test)"`), this doesn't work. You'll need to use the full one from the post mentioned. – porges Apr 28 '11 at 04:56
  • @You're right Porges. That's the first thing I thought when I decided to write some code about it :) – Oscar Mederos Apr 28 '11 at 05:01
2

I came across a similar scenario, For me the count of number of pipes were fixed(not pipes with "\|") . This is how i have handled.

string sPipeSplit = "This|is|a|pip\\|ed|test (this is a pip|ed test)";
string sTempString = sPipeSplit.Replace("\\|", "¬"); //replace \| with non printable character
string[] sSplitString = sTempString.Split('|');
//string sFirstString = sSplitString[0].Replace("¬", "\\|"); //If you have fixed number of fields and you are copying to other field use replace while copying to other field.
/* Or you could use a loop to replace everything at once
foreach (string si in sSplitString)
{
    si.Replace("¬", "\\|");
}
*/
Akshay
  • 407
  • 1
  • 4
  • 14
1

Here is another solution.

One of the most beautiful thing about programming, is the several ways of giving a solution to the same problem:

string text = @"This|is|a|pip\|ed|test"; //The original text
string parsed = ""; //Where you will store the parsed string

bool flag = false;
foreach (var x in text.Split('|')) {
    bool endsWithArroba = x.EndsWith(@"\");
    parsed += flag ? "|" + x + " " : endsWithArroba ? x.Substring(0, x.Length-1) : x + " ";
    flag = endsWithArroba;
}
Oscar Mederos
  • 29,016
  • 22
  • 84
  • 124
0

Cory's solution is pretty good. But, i fyou prefer not to work with Regex, then you can simply do something searching for "\|" and replacing it with some other character, then doing your split, then replace it again with the "\|".

Another option is is to do the split, then examine all the strings and if the last character is a \, then join it with the next string.

Of course, all this ignores what happens if you need an escaped backslash before a pipe.. like "\\|".

Overall, I lean towards regex though.

Frankly, I prefer to use FileHelpers because, even though this isn't comma delimeted, it's basically the same thing. And they have a great story about why you shouldn't write this stuff yourself.

Erik Funkenbusch
  • 92,674
  • 28
  • 195
  • 291
0

You can do this with a regex. Once you decide to use a backslash as your escape character, you have two escape cases to account for:

  • Escaping a pipe: \|
  • Escaping a backslash that you want interpreted literally.

Both of these can be done in the same regex. Escaped backslashes will always be two \ characters together. Consecutive, escaped backslashes will always be even numbers of \ characters. If you find an odd-numbered sequence of \ before a pipe, it means you have several escaped backslashes, followed by an escaped pipe. So you want to use something like this:

/^(?:((?:[^|\\]|(?:\\{2})|\\\|)+)(?:\||$))*/

Confusing, perhaps, but it should work. Explanation:

^              #The start of a line
(?:...
    [^|\\]     #A character other than | or \ OR
    (?:\\{2})* #An even number of \ characters OR
    \\\|       #A literal \ followed by a literal |
...)+          #Repeat the preceding at least once
(?:$|\|)       #Either a literal | or the end of a line
Justin Morgan - On strike
  • 30,035
  • 12
  • 80
  • 104
  • @Justin for some reason it isn't working on my computer. Also, a `)` is missing. – Oscar Mederos Apr 28 '11 at 05:03
  • @Oscar - There were so many nested parentheses it was tough to keep track. Try it now. – Justin Morgan - On strike Apr 28 '11 at 05:07
  • @Justin now it works, although it is happening the same with @Cory solution: **A\\|b** should become **A\|b** instead of A\\ and **b**. The first \\ is a character like any other, and the second one is escaping the **|**, so the second one will be removed and the sentence will remain as it is. – Oscar Mederos Apr 28 '11 at 05:10
  • @Oscar - If you input `A\\|b`, you have escaped the backslash character itself, so it should be interpreted as `A\` plus `b`. To get `A\|b`, you would input `A\\\|b`. That's how I would expect it to work, myself, and it's consistent with most escape schemes I've seen. In C#, for example, the string `\\\n` would be a literal `\` and a carriage return. – Justin Morgan - On strike Apr 28 '11 at 05:24
  • @Justin it depends on how you take it. When someone tells you: `I want to parse the string ABC\DE`, you should assume that \ is already being escaped. Otherwise the original example doesn't make sense, because C# itself will give an error if you write "\|" because you are escaping nothing here. In order to resume, what I think is that the string to parse is literal (already escaped). – Oscar Mederos Apr 28 '11 at 06:31
  • @Oscar - I see what you're getting at. On the other hand, if you don't do it this way, there'd be no way to have an input ending in a literal backslash. If you wanted "A\" and "b", neither **A\|b** nor **A\\|b** would work. Declaring \ as an escape character forces the user to escape it throughout the text, but it does allow all possible inputs. That might not even be valid for the questioner's situation, but I decided to go with the least restrictive option. BTW, looks like we've both run afoul of Stack Overflow's own escaping rules. – Justin Morgan - On strike Apr 28 '11 at 14:07