I am trying to figure out how to split a file by the number of lines in each file. THe files are csv and I can't do it by bytes. I need to do it by lines. 20k seems to be a good number per file. What is the best way to read a stream at a given position? Stream.BaseStream.Position? So if I read the first 20k lines i would start the position at 39,999? How do I know I am almost at the end of a files? Thanks all
Asked
Active
Viewed 1.2k times
3
-
1Have you tried 20k calls to ReadLine? – strager Jul 30 '10 at 17:41
-
2You shouldn't need to seek at all. You should read it line by line, switching to a new file once you hit 20k. – Fosco Jul 30 '10 at 17:42
-
Yeah, after I wrote this and went to get my hair cut. It dawned on me that I can jsut read it unitl the end and do a readline. Thanks! – DDiVita Jul 30 '10 at 18:55
3 Answers
6
using (System.IO.StreamReader sr = new System.IO.StreamReader("path"))
{
int fileNumber = 0;
while (!sr.EndOfStream)
{
int count = 0;
using (System.IO.StreamWriter sw = new System.IO.StreamWriter("other path" + ++fileNumber))
{
sw.AutoFlush = true;
while (!sr.EndOfStream && ++count < 20000)
{
sw.WriteLine(sr.ReadLine());
}
}
}
}

Jon B
- 51,025
- 31
- 133
- 161
-
This seems the most straight forward to me, though for memory's sake I would flush the write buffer with each write possibly. if each line is 100 bytes, that makes 1000 lines 100k, and 20000 2Mb, not a ton of memory but an unnecesarry foot print.. – Jimmy Hoffa Jul 30 '10 at 18:06
-
@Jimmy - I added `AutoFlush = True`, which automatically flushes after each write. – Jon B Jul 30 '10 at 18:16
-
AutoFlush is a bad idea on a StreamWriter as it will flush after every single character (I looked at the code). If you don't specify a buffer size when creating a StreamWriter it defaults to only 128 characters, but that's still better than no buffer at all. – Tergiver Jul 30 '10 at 19:37
4
int index=0;
var groups = from line in File.ReadLines("myfile.csv")
group line by index++/20000 into g
select g.AsEnumerable();
int file=0;
foreach (var group in groups)
File.WriteAllLines((file++).ToString(), group.ToArray());

Muhammad Hasan Khan
- 34,648
- 16
- 88
- 131
-
You need to use `File.ReadLines` instead of `ReadAllLines` -- `ReadAllLines` reads it all into memory at once. Also, using `index` in the grouping function like that freaks me out. – mqp Jul 30 '10 at 17:48
-
While this is indeed interesting, there are enough cases that you don't want to read an entire file into memory that I would at least add the stipulation that you need to know the files won't be too large if you're going to use this method.. – Jimmy Hoffa Jul 30 '10 at 18:03
-
Won't the grouping method collect everything regardless of whether you use ReadLines or ReadAllLines? – Lasse V. Karlsen Jul 30 '10 at 18:17
-
I assume so, but with `ReadAllLines`, you'd have the whole thing in memory twice instead of once. – mqp Jul 30 '10 at 18:47
-
3
I'd do it like this:
// helper method to break up into blocks lazily
public static IEnumerable<ICollection<T>> SplitEnumerable<T>
(IEnumerable<T> Sequence, int NbrPerBlock)
{
List<T> Group = new List<T>(NbrPerBlock);
foreach (T value in Sequence)
{
Group.Add(value);
if (Group.Count == NbrPerBlock)
{
yield return Group;
Group = new List<T>(NbrPerBlock);
}
}
if (Group.Any()) yield return Group; // flush out any remaining
}
// now it's trivial; if you want to make smaller files, just foreach
// over this and write out the lines in each block to a new file
public static IEnumerable<ICollection<string>> SplitFile(string filePath)
{
return File.ReadLines(filePath).SplitEnumerable(20000);
}
Is that not sufficient for you? You mention moving from position to position,but I don't see why that's necessary.

mqp
- 70,359
- 14
- 95
- 123