0

There are lots of search results on the web (and in SO) for something similar to what I need to do, but I haven't run into a solution yet for my particular situation.

I have a comma-delimited file in which only columns that have commas within them have double quotes around them. Other fields that have no comma in them are simply separated by the comma.

Take this example:

123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"

The output for that line needs to be:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man|africa|yellow|jump,rope

I have this code currently:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        line = Replace(line, strReplacerQuoteCommaQuote, "|")
        line = Replace(line, strReplacerQuoteComma, "|")
        line = Replace(line, strReplacerCommaQuote, "|")
        line = Replace(line, Chr(34), "")

        Console.WriteLine("line: " & line)
    Loop
End Using

The problem with that process is when I get to the fourth Replace() line the string looks like this:

123|box,toy|phone|red,car,cat,dog|bike,pencil|man,africa,yellow|jump,rope

So man and africa need pipes after them, but I can't just do a Replace on all commas, obviously.

How can I do this? Is there a RegEx statement that could handle this?

UPDATE With Working Code

The link in Avinash's comment had the answer that I went with. I Imported System.Text.RegularExpressions and used the following:

Using sr As New StreamReader(csvFilePath)
    Dim line As String = ""
    Dim strReplacerQuoteCommaQuote As String = Chr(34) & "," & Chr(34)
    Dim strReplacerQuoteComma As String = Chr(34) & ","
    Dim strReplacerCommaQuote As String = "," & Chr(34)

    Do While sr.Peek <> -1
        line = sr.ReadLine
        Dim pattern As String = "(,)(?=(?:[^""]|""[^""]*"")*$)"
        Dim replacement As String = "|"
        Dim regEx As New Regex(pattern)

        Dim newLine As String = regEx.Replace(line, replacement)
        newLine = newLine.Replace(Chr(34), "")

        Console.WriteLine("newLine: " & newLine)
    Loop
End Using
Community
  • 1
  • 1
marky
  • 4,878
  • 17
  • 59
  • 103
  • 2
    i think it's better to use csv file parsers. If you wanna regex then this [answer](http://stackoverflow.com/a/25544437/3297613) will help you to match all the commas which are present outside the double quotes. Finally replace all the matched commas with `|` – Avinash Raj Dec 12 '14 at 18:53
  • 1
    In the future, I would stay away from any regex like this `"(,)(?=(?:[^""]|""[^""]*"")*$)"` as it has to look ahead to the end of the string each time it matches a comma, like n factorial. –  Dec 12 '14 at 20:00
  • Sure enough: I'm currently testing a full "production"-version file that end users will be converting. It's got around 90k rows and it's taking a LOOOOOOOOONG time to convert! What's another angle of attack on this? – marky Dec 12 '14 at 20:40
  • Sounds like you have some other issue. I just processed a text file with 90k rows of `123,"box,toy",phone,"red,car,cat,dog","bike,pencil",man,africa,yellow,"jump,rope"` in well under 2 seconds. What are you actually doing with the extracted data? – The Blue Dog Dec 12 '14 at 21:20
  • @Blue Dog Interesting that yours processed so fast. The actual files I'm using have 14 fields with, at the most, about 200 chars in one or two fields per row, but I wouldn't think that would make that much of a difference. I'll have to set up a test app with just the conversion code and see if I can track down whatever may be slowing it down. Thanks for checking on that, though. – marky Dec 12 '14 at 21:36
  • Dunno mate, I had another bash this morning (see edit 2) with a massive input file and wrote it back out to another file - still way quicker than you. – The Blue Dog Dec 13 '14 at 09:05
  • I posted some benchmark if that helps at all. –  Dec 13 '14 at 13:53

3 Answers3

3

This seems to work for your example:

Dim result = Regex.Replace(input, ",(?=([^""]*""[^""]*"")*[^""]*$)", Function(m) m.Value.Replace(",", "|"))
result = result.Replace(Chr(34), "")

enter image description here

See the accepted answer here for the regex explanation and be sure to upvote @mathematical.coffee's answer there while you're at it as I basically just stole his regex.

Edit: Regarding your performance issue, I created a file with 90k lines of:

abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,yellow,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

which roughly equals a file size of 35MB, my laptop (nothing special) will parse that in around 6.5 seconds.

Yes, the regex is slow and the TextFieldParser class is also widely reported as not being the quickest but you obviously have some other bottleneck in your code if you are still processing for over 5 minutes. Please note that I am not actually doing anything with the parsed result though.

Edit 2: Ok, I thought I'd have one last go at this (I was bored this morning) but I still can't replicate your extended conversion time.

Time to get brutal, I created an input file with 150k lines of:

abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"

Each line has 1140 characters, total file size ~167MB.

Reading, converting and writing back out to a new file using the following code took 29 seconds.

Dim line, result As String
Dim replace As String = ",(?=([^""]*""[^""]*"")*[^""]*$)"
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = Regex.Replace(line, replace, Function(m) m.Value.Replace(",", "|"))
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

Edit 3: Using @sln's regex and this code cuts the processing time for the same file to 4 seconds.

Dim line, result As String
Dim pattern As String = ",([^,""]*(?:""[^""]*"")?[^,""]*)(?=,|$)"
Dim replacement As String = "|$1"
Dim rgx As New Regex(pattern)
Using sw As New StreamWriter("d:\output.txt")
    Using sr As New StreamReader("d:\input.txt")
        While Not sr.EndOfStream
            line = sr.ReadLine
            result = rgx.Replace(line, replacement)
            sw.WriteLine(result.Replace(Chr(34), ""))
        End While
    End Using
End Using

So there you go, I think you have a winner. As sln states, this is a relative test so machine speed is irrelevant.

,(?=([^"]*"[^"]*")*[^"]*$)          took 29 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 4 seconds

Finally (and just for completeness) the solution proposed by @jawood2005 is very workable:

Dim line As String
Dim fields As String()
Using sw As New StreamWriter("d:\output.txt")
    Using tfp As New FileIO.TextFieldParser("d:\input.txt")
        tfp.TextFieldType = FileIO.FieldType.Delimited
        tfp.Delimiters = New String() {","}
        tfp.HasFieldsEnclosedInQuotes = True
        While Not tfp.EndOfData
            fields = tfp.ReadFields
            line = String.Join("|", fields)
            sw.WriteLine(line.Replace(Chr(34), ""))
        End While
    End Using
End Using

Using the same 150k row input file as the regex solutions this completes in 18 seconds, so better than mine but sln wins the prize for the fastest solution to your problem.

Stevoisiak
  • 23,794
  • 27
  • 122
  • 225
The Blue Dog
  • 2,475
  • 3
  • 19
  • 25
  • Seems that's all over SO. Avinash got the same regex from another link to an answer by MarcusQ. You get the Answered check though. :) – marky Dec 12 '14 at 19:34
  • I'm no regex expert by any stretch of the imagination, fortunately SO has many of them. Ultimately all code is recycled to a point but I'll always credit my source. – The Blue Dog Dec 12 '14 at 19:46
  • This method is actually pretty bad, the overhead in doing constant voluminous lookahead to the end of string is terrible. –  Dec 12 '14 at 19:47
  • @sln: I was just thinking that I remembered you from a question I asked before. My regex hasn't improved any since then! – The Blue Dog Dec 12 '14 at 19:48
  • Nah, its okay. But the latency is an issue. –  Dec 12 '14 at 19:54
  • @sln: Having seen your answer (and actually understood what you're doing there), I totally agree with you. I've flagged this answer for removal, OP really should accept yours instead. – The Blue Dog Dec 12 '14 at 20:00
  • I would not remove your answer. And I don't come here for the points. –  Dec 12 '14 at 20:05
  • @sln: Neither do I and I also don't want to suggest inefficient ways of doing things either. I'm fine with having it deleted, honestly ... – The Blue Dog Dec 12 '14 at 20:10
  • Okay, guys, enough bickering ;) Truth be told, it was the link in Avinash's comment that I followed and used. I merely selected Blue Dog's answer because it was an Answer that contained the code that I used. – marky Dec 12 '14 at 21:31
  • Blue Dog, Wow! Thanks for the time you put into this. I separated out the method to convert the CSV (I found I had it nested in a couple of other functions) and copy/pasted your code (the Edit 3 version) and it ran on a 97K row file in about 3 seconds! I really appreciate the help! – marky Dec 15 '14 at 15:44
  • @marky: No problem at all fella, glad you got it sorted. – The Blue Dog Dec 15 '14 at 17:37
3

Bullet proof way.

 # Validate even quotes (one time match):  ^[^"]*(?:"[^"]*"[^"]*)*$   
 # Then ->
 # ----------------------------------------------
 # Find:  /,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/
 # Replace:  '|$1'

 ,
 (                             # (1 start)
      [^,"]*  
      (?: " [^"]* " )?
      [^,"]*  
 )                             # (1 end)
 (?= , | $ )

Benchmarks

Since @TheBlueDog posted a Benchmark ('Edit 2'), I thought I would post a
benchmark as well.

Its based on his input, and the intent is to demonstrate the evils of using
'to-the-end-of-string' lookahead as a validation technique
(ie. this -> ^[^"]*(?:"[^"]*"[^"]*)*$)

Blue Dog's regex replace method is hampered a bit with an unnecessary callback, so I
imagine that accounts for some of his bad numbers.

Don't know Vb.net so this is done in Perl. The machine speed and language are factored out
because its a relative test.

Summary:

,(?=([^"]*"[^"]*")*[^"]*$)          took 10 seconds
,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)  took 2 seconds  

This represents a 5 fold difference.

Benchmark in Perl, 150K line (167MB file):

use strict;
use warnings;

use Benchmark ':hireswallclock';
my ($t0,$t1);
my ($infile, $outfile);

my $tstr = 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz","abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz",abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz,"abcdefghijklmnopqrstuvwxyz,abcdefghijklmnopqrstuvwxyz"
';

# =================================================
print "\nMaking 150K line (167MB file), csv_data_in.txt ...";

open( $infile, ">", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for writing $!";
for (1 .. 150_000)
{
   print $infile $tstr;
}
close( $infile );

print "\nDone !\n\n";

# =================================================
print "Converting delimiters, writing to csv_data_out.txt ...";

open( $infile, "<", 'csv_data_in.txt' ) or die "can't open 'csv_data_in.txt' for readimg $!";
open( $outfile, ">", 'csv_data_out.txt' ) or die "can't open 'csv_data_out.txt' for writing $!";

my $line = '';

$t0 = new Benchmark;
while( $line = <$infile> )
{
    # Validation - Uncomment to check line for even quotes, otherwise don't
    # if ( $line =~ /^[^"]*(?:"[^"]*"[^"]*)*$/ )
    # {
        $line =~ s/,([^,"]*(?:"[^"]*")?[^,"]*)(?=,|$)/|$1/g;
    # }
    print $outfile $line;
}
$t1 = new Benchmark;

close( $infile );
close( $outfile );

print "\nDone !\n";
print "Conversion took: ", timestr(timediff($t1, $t0)), "\n\n";

Output:

Making 150K line (167MB file), csv_data_in.txt ...
Done !

Converting delimiters, writing to csv_data_out.txt ...
Done !
Conversion took: 2.1216 wallclock secs ( 1.87 usr +  0.17 sys =  2.04 CPU)
  • That's mighty impressive, you've certainly proved a point! I took the liberty of using your regex in the VB code and got 4.3 seconds, see edit. This has been fun! – The Blue Dog Dec 13 '14 at 14:09
1

This may not be the best solution, but it should work...

I'm 99% sure you're using StreamReader ("sr") to read the file. Try reading it in with a FileIO.TextFieldParser, which will allow you to split the line into string array.

Dim aFile As FileIO.TextFieldParser = New FileIO.TextFieldParser(filePath)
Dim temp() As String ' this array will hold each line of data
Dim order As doOrder = Nothing
Dim orderID As Integer
Dim myDate As DateTime = Now.ToString

aFile.TextFieldType = FileIO.FieldType.Delimited
aFile.Delimiters = New String() {","}
aFile.HasFieldsEnclosedInQuotes = True

temp = aFile.ReadFields

' parse the actual file
Do While Not aFile.EndOfData...

In the loop, keep using "aFile.ReadFields" to read the next line. Once you have your String array, you can concatenate each field with pipes in between them. Kind of messy, and not Regex (don't know if that's an actual condition or just an idea), but will get the job done.

Also, please note the "aFile.HasFieldsEnclosedInQuotes = True" since that is one of the conditions you listed.

EDIT: I see that The Blue Dog gave a Regex answer while I was trying to type... You may still want to look into using the TextFieldParser regardless since you are reading a delimited file. I'll just go away now.

jawood2005
  • 45
  • 4
  • I just posted a comment to Blue Dog's answer that the RegEx approach won't work. I have to work with 90k-plus lines in a csv. I'm running one now that's been going for 5 minutes! I've been looking at the TextFieldParser class, but I'm not sure it will work, as it doesn't seem like that can work with my situation where some fields have quotes and some don't and some fields will have commas in them. And yes, I"m using StreamReader/Writer – marky Dec 12 '14 at 20:49
  • @marky The TextFieldParser can handle lines with or without quotes if the "HasFieldsEnclosedInQuotes" value is true. Also, if there are quotes in a field, that field is lumped together, even with commas in them. I had to use it for a project about 6 months ago (this code was more or less ripped from that). I won't guarantee that it will serve your purposes, but it will meet the criteria you've laid out. – jawood2005 Dec 12 '14 at 21:09
  • Okay, I tested the TextFieldParser and it runs about as slowly as the regEx approach. I.e. it took about 5 minutes to get through about 20k lines in a 90k-line file :( Any other ideas? – marky Dec 12 '14 at 21:22
  • Are you just changing the delimiters and writing to a file? If so, if you're using streamwriter, make sure you flush it. That may increase the speed quite a bit. – jawood2005 Dec 13 '14 at 01:35
  • This solution is workable and actually faster than the regex I originally posted, see my edited post. @sln still wins by a country mile for speed though. :) – The Blue Dog Dec 13 '14 at 16:47