1

I've been working with some big delimited text (~1GB) files these days. It looks like somewhat below

COlumn1 #COlumn2#COlumn3#COlumn4
COlumn1#COlumn2#COlumn3 #COlumn4

where # is the delimiter.

In case a column is invalid I might have to remove it from the whole text file. The output file when Column 3 is invalid should look like this.

COlumn1 #COlumn2#COlumn4
COlumn1#COlumn2#COlumn4

string line = "COlumn1# COlumn2 #COlumn3# COlumn4";
int junk =3;
int columncount = line.Split(new char[] { '#' }, StringSplitOptions.None).Count();
//remove the [junk-1]th '#' and the value till [junk]th '#'
//"COlumn1# COlumn2 # COlumn4"

I's not able to find a c# version of this in SO. Is there a way I can do that? Please help.

EDIT: The solution which I found myself is like below which does the job. Is there a way I could modify this to a better way so that it narrows down the performance impact it might have in case of large text files?

int junk = 3;
string line = "COlumn1#COlumn2#COlumn3#COlumn4";
int counter = 0;
int colcount = line.Split(new char[] { '#' }, StringSplitOptions.None).Length - 1;
string[] linearray = line.Split(new char[] { '#' }, StringSplitOptions.None);
List<string> linelist = linearray.ToList();
linelist.RemoveAt(junk - 1);
string finalline = string.Empty;
foreach (string s in linelist)
{
    counter++;
    finalline += s;
    if (counter < colcount)
             finalline += "#";
}

Console.WriteLine(finalline);
snippetkid
  • 282
  • 4
  • 16
  • Get the array from split then remove the element before writing it back. To remove an element in an array, have a look at http://stackoverflow.com/questions/457453/remove-element-of-a-regular-array – cup May 08 '14 at 05:03

2 Answers2

2

EDITED

This method can be very memory expensive, as your can read in this post, the suggestion should be:

If you need to run complex queries against the data in the file, the right thing to do is to load the data to database and let DBMS to take care of data retrieval and memory management.

To avoid memory consumption you should use a StreamReader to read file line by line This could be a start for your task, missing your invalid match logic

using System.Collections.Generic;
using System.IO;
using System.Text;

namespace ConsoleApplication1
{
  class Program
  {
    static void Main(string[] args)
    {

      const string fileName = "temp.txt";

      var results = FindInvalidColumns(fileName);
      using (var reader = File.OpenText(fileName))
      {
        while (!reader.EndOfStream)
        {
          var builder = new StringBuilder();
          var line = reader.ReadLine();
          if (line == null) continue;
          var split = line.Split(new[] { "#" }, 0);

          for (var i = 0; i < split.Length; i++)
            if (!results.Contains(i))
              builder.Append(split[i]);

          using (var fs = new FileStream("new.txt", FileMode.Append, FileAccess.Write))
          using (var sw = new StreamWriter(fs))
          {
            sw.WriteLine(builder.ToString());
          }
        }
      }
    }

    private static List<int> FindInvalidColumns(string fileName)
    {
      var invalidColumnIndexes = new List<int>();
      using (var reader = File.OpenText(fileName))
      {
        while (!reader.EndOfStream)
        {
          var line = reader.ReadLine();
          if (line == null) continue;

          var split = line.Split(new[] { "#" }, 0);
          for (var i = 0; i < split.Length; i++)
          {
            if (IsInvalid(split[i]) && !invalidColumnIndexes.Contains(i))
              invalidColumnIndexes.Add(i);
          }
        }
      }
      return invalidColumnIndexes;
    }

    private static bool IsInvalid(string s)
    {
      return false;
    }
  }
}
Community
  • 1
  • 1
ale
  • 10,012
  • 5
  • 40
  • 49
  • -1 Will almost definitely cause a OOM exception. Given the OP said they have a 1GB file to process. – Aron May 08 '14 at 07:10
  • @Aron there is alternative to buffering the file? – ale May 09 '14 at 07:07
  • 1
    You are putting the output into a StringBuilder. That SB should end up approximately as big as the original file. Add to that the ineffiencies of GC and growing that "List", this should easily munch through your memory. Since you aren't going backwards in your SB at any point, you could just as easily replace your string builder with a StreamWriter. – Aron May 09 '14 at 07:11
0

First, what you will do is re-write the line to a text file using a 0-length string for COlumn3. Therefore the line after being written correctly would look like this:

COlumun1#COlumn2##COlumn4

As you can see, there are two delimiters between COlumn2 and COlumn4. This is a cell with no data in it. (By "cell" I mean one column of a certain, single row.) Later, when some other process reads this using the Split function, it will still create a new value for Column 3, but in the array generated by Split, the 3rd position would be an empty string:

String[] columns = stream_reader.ReadLine().Split('#');
int lengthOfThirdItem = columns[2].Length;  // for proof
//  lengthOfThirdItem = 0

This reduces invalid values to null and persists them back in the text file.

For more on String.Split see C# StreamReader save to Array with separator.

It is not possible to write to lines internal to a text file while it is also open for read. This article discusses it some (simultaneous read-write a file in C#), but it looks like that question-asker just wants to be able to write lines to the end. You want to be able to write lines at any point in the interior. I think this is not possible without buffering the data in some way.

The simplest way to buffer the data is rename the file to a temp file first (using File.CoMovepy() // http://msdn.microsoft.com/en-us/library/system.io.file.move(v=vs.110).aspx). Then use the temp file as the data source. Just open the temp file that to read in the data which may have corrupt entries, and write the data afresh to the original file name using the approach I describe above to represent empty columns. After this is complete, then you should delete the temp file.

Important

Deleting the temp file may leave you vulnerable to power and data transients (or software 'transients'). (I.e., a power drop that interrupts part of the process could leave the data in an unusable state.) So you may also want to leave the temp file on the drive as an emergency backup in case of some problem.

Community
  • 1
  • 1
philologon
  • 2,093
  • 4
  • 19
  • 35
  • This would break if the row already contains a field with empty entry right? Means in my row an empty column need not be an invalid column (Sorry I didn't mention my valid rows can have empty value in the column like COlumn1 # #COlumn3#COlumn4) – snippetkid May 08 '14 at 11:28
  • I would not say that it would actually break. It is hard for me to know for sure without more details on the class or classes involved. For example, I did not know that you already have empty Cells. Given the clarification you just made, you have two choices now. Switch to another format such as JSON, which labels the fields again for every row, in which case the invalid cell would never appear, or invent a a keyword which means "invald" but would probably not ever appear in real data; something like the word "invalid" or maybe "i_n_v_a_l_i_d". – philologon May 08 '14 at 15:38
  • In tagged values (JSON or XML), order does not necessarily have to matter (within a row or record) and presence is not required for empty cells. In delimiter separated values, order matters and presence is required because field id is done by counting delimiters – philologon May 08 '14 at 16:29