0

I need to split lines into parts in a big file (200MB-5GB) where lines looks like this

value1;value2;"value3;extra";value4;"value5;extra"

Line needs to be split by semicolon . Regular String.Split does not work since semicolons can be inside of the quotes.

I think regular expression would work best here especially if file has millions of lines. I appreciate any guidance or code that would help me to split.

Update:

The result I want to see for the above sample line is

value1
value2
"value3;extra"
value4
"value5;extra"

Thank you

SWeko
  • 30,434
  • 10
  • 71
  • 106
Alex S
  • 1,171
  • 1
  • 9
  • 25
  • 1
    string.split will work, I guess that it will just not follow all your expectations. Please provide an example of how do you want the output line to look like. Also, you may want to use the term "semicolon" instead of "semi column". – BartoszKP Aug 12 '13 at 13:23
  • Are you able to process this file line-by-line, or will it be presented as chunks of data? – Adrian Wragg Aug 12 '13 at 13:25
  • can you remove the "'s before the split on ;? – Yoztastic Aug 12 '13 at 13:25
  • @AdrianWragg I have the whole file available – Alex S Aug 12 '13 at 13:25
  • @Yoztastic quotes are part of the value that I need to retain – Alex S Aug 12 '13 at 13:26
  • @BartoszKP I added sample result that I want to get after splitting one line – Alex S Aug 12 '13 at 13:26
  • @m.buettner This is not the same as CSV, please reconsider your downvote – Alex S Aug 12 '13 at 13:28
  • CSV files, contrary to the name, are not always comma-separated. Basically CSV is an umbrella term for all files where there are records in separate lines, with fields separated by some delimiter. Try using something like [CsvHelper](https://github.com/JoshClose/CsvHelper) to parse the lines. – SWeko Aug 12 '13 at 13:31
  • I would recommend KBCSV data reader, which can solve the job easily and can handle the varios problems with enquoted data. – Christian Sauer Aug 12 '13 at 13:34
  • Parse it as a CSV file. The `TextFieldParser` class recommend in the answer is the way to go. – Jim Mischel Aug 12 '13 at 13:35
  • "I think regular expression would work best here especially if file has millions of lines." That's the slowest approach. – Tim Schmelter Aug 12 '13 at 13:41
  • @TimSchmelter Tim thanks for the comment, would regex improve performance when matching against the whole file and not just one line? – Alex S Aug 12 '13 at 13:43
  • @AlexS: That makes no difference, also, i don't see the need for regex here. – Tim Schmelter Aug 12 '13 at 13:44
  • @AlexS as the others pointed out, this is very well CSV. and I never downvoted. – Martin Ender Aug 12 '13 at 15:01

1 Answers1

4

Add a reference to Microsoft.VisualBasic and use the TextFieldParser class:

using System;
using System.IO;
using Microsoft.VisualBasic.FileIO;

class Program
{
    static void Main(string[] args)
    {
        using(var input = File.OpenRead("input.txt"))
        using(var tfp = new TextFieldParser(input))
        {
            tfp.SetDelimiters(new string[] { ";" });
            tfp.HasFieldsEnclosedInQuotes = true;
            var fields = tfp.ReadFields();
            foreach (var field in fields)
            {
                Console.WriteLine(field);
            }
        }
    }
}
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
Alex Filipovici
  • 31,789
  • 6
  • 54
  • 78
  • 2
    +1. [TextFieldParser class](http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx) – Jim Mischel Aug 12 '13 at 13:33
  • 1
    Thank you, I appreciate your asnwer – Alex S Aug 12 '13 at 13:37
  • Note to admins: I did not expect to get this answer with so many downvotes :) – Alex S Aug 12 '13 at 13:39
  • 1
    @AlexS: Note that your sample data suggests that you are using a quoting character `"`. Therefore set `HasFieldsEnclosedInQuotes = true`. Also, use the `using` statement(edited Alex answer accordingly). Btw, Stackoverflow is moderated by it's users (with enough reputation). – Tim Schmelter Aug 12 '13 at 13:51