0

I have a text file, and I need to remove some trailing delimiters. The text file looks like this:

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39
                2|'Nguyen Van B'|'Nguyen| Van B'|39";
string result = @"1|'Nguyen Van A'|'Nguyen Van A'|39
                  2|'Nguyen Van B'|'Nguyen Van B'|39";

I want to remove the char "|" In the string "Nguyen Van | A" and "Nguyen | Van B"

So I think the best way is to do a Regex replace? Can anyone help me with this regex?

Thanks

pstrjds
  • 16,840
  • 6
  • 52
  • 61
  • 1
    So you have a text file that is pipe delimited and you allow pipes within an entry when the entry is quoted? Am I understanding that correctly? Also are you allowed to use a third party library? – pstrjds Mar 11 '17 at 11:49
  • I would write it by hand instead of fighting against regexes... In the end it is a simple state machine. – xanatos Mar 11 '17 at 11:53
  • `BULK INSERT Product FROM 'D:\product.data' WITH ( FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' );` I use BULK INSERT SQL Server to import file but have errors in these fields. Bulk load data conversion error (type mismatch or invalid character for the specified codepage). – Nguyễn Quang Mar 11 '17 at 11:59
  • Dear xanatos, Currently, I am in the data interface of the two systems so everything is import/export automatically. Thanks – Nguyễn Quang Mar 11 '17 at 12:04
  • @NguyễnQuang - I am also sort of confused now. Are you doing this in code to prepare your bulk insert file? The default FIELDQUOTE character is `"` so you will potentially have other issues, not just the extraneous pipe characters? Maybe it would be better if you wrote the question stating the actual problem rather than your proposed workaround, as there are potentially better ways to solve it. – pstrjds Mar 11 '17 at 12:08
  • Dear pstrjds, I have read this article it looks like SQL Server does not support this case. http://stackoverflow.com/questions/12902110/bulk-insert-correctly-quoted-csv-file-in-sql-server – Nguyễn Quang Mar 11 '17 at 12:10
  • @NguyễnQuang - Wow, that was a surprising read. I didn't realize that was broken (it has been at least 3 years since I have done anything with SQL server, I just remembered that setting being there and suggested it). So basically, you want to read in your whole CSV, clean the data and then write it back out so you can do a bulk insert into SQL. Is that correct? – pstrjds Mar 11 '17 at 12:15
  • Yes, that's right. Because the BULK INSERT command does not accept these record. I need to clean it before using the BULK INSERT. – Nguyễn Quang Mar 11 '17 at 12:24
  • @NguyễnQuang - I have added an answer that may be helpful to you. It shows how to do this without the Regex. – pstrjds Mar 11 '17 at 14:53

2 Answers2

1

The regex should be:

(?<=^[^']*'([^']*'[^']*')*[^']*)\|

to be used Multiline... so

var rx = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|", RegexOptions.Multiline);

string text = @"1|'Nguyen Van| A'|'Nguyen Van A'|39

2|'Nguyen Van B'|'Nguyen| Van B'|39";

string replaced = rx.Replace(text, string.Empty);

Example: http://ideone.com/PTdsg5

I strongly suggest against using it... To explain why... Try to comprehend the regular expression. If you can comprehend it, then you can use it :-)

I would write a simple state machine that counts ' and replaces the | when the counted ' is odd.

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • Because the BULK INSERT command does not accept these record. I need to clean it before using the BULK INSERT. http://stackoverflow.com/questions/12902110/bulk-insert-correctly-quoted-csv-file-in-sql-server The solution solved the problem I encountered. However, for large file data, the process runs for a long time. Do you have a better way? Such as remove char | of content begins with string |' and the end is string '| Instead of counts ' and replaces the | when the counted ' is odd. – Nguyễn Quang Mar 11 '17 at 13:24
  • @NguyễnQuang - You should post this as a new question and be sure to post the relevant code (i.e. how you are opening and parsing the file, where you are calling the regex.Replace and how you are writing out the new file) to that question. – pstrjds Mar 11 '17 at 13:37
  • Thanks for looking for a solution here: http://stackoverflow.com/questions/13305139/c-sharp-replace-expression-matching-start-end – Nguyễn Quang Mar 11 '17 at 13:56
  • @NguyễnQuang - Please post a new question with the code you are using. I am assuming that you are loading the whole file and using this Regex with the multiline mode. Post the new question and I can post an answer to help you. – pstrjds Mar 11 '17 at 14:00
0

You mentioned using the multiline regex is taking too long and asked about the state machine approach. So here is some code using a function to perform the operation (note, the function could probably use a little cleaning, but it shows the idea and works faster than the regex). In my testing, using the regex without multiline, I could process 1,000,000 lines (in memory, not writing to a file) in about 34 seconds. Using the state-machine approach it was about 4 seconds.

string RemoveInternalPipe(string line)
{
    int count = 0;
    var temp = new List<char>(line.Length);
    foreach (var c in line)
    {
        if (c == '\'')
        {
            ++count;
        }
        if (c == '|' && count % 2 != 0) continue;
        temp.Add(c);
    }
    return new string(temp.ToArray());
};

File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => RemoveInternalPipe(x)));

To compare the performance against the Regex version (without the multiline option), you could run this code:

var regex = new Regex(@"(?<=^[^']*'([^']*'[^']*')*[^']*)\|");
File.WriteAllLines(@"yourOutputFile",
    File.ReadLines(@"yourInputFile").Select(x => regex.Replace(x, string.Empty));
pstrjds
  • 16,840
  • 6
  • 52
  • 61