0

I'm new to C++ and probably have a silly question. I have an ifstream which I'd like to split approximately in half.

The file in question is a sorted csv and I wish to search on the first value of each line of the file.

Eventually the file will be very large so I am trying to avoid having to read every line of the file.

e.g.

If the file contains 7 lines I'd like to split the ifstream to give 1 stream containing the first 3 lines and 1 stream containing the last 4 lines.

Dunc
  • 7,688
  • 10
  • 31
  • 38
  • Since your description is practically the whole algorithm, I assume you need something more. Perhaps you can tell us how you plan to use this? – Codie CodeMonkey Dec 22 '11 at 11:05
  • ¤ Sometimes a really baffling question just means we fail to envision the prison of thought that the one asking the question has been placed in. Perhaps you're thinking about how to count the lines so as to first output 3 of them and then 4. Well listen to the old dog: just alternately output 1 line to one of the streams, and 1 line to the other stream, it's as simple as that. Cheers & hth., – Cheers and hth. - Alf Dec 22 '11 at 11:12
  • Rather than the expense of splitting streams, can't you just seek to an appropriate position in the same stream? – GazTheDestroyer Dec 22 '11 at 11:26
  • Thaks Gaz, that makes much more sense. – Dunc Dec 22 '11 at 11:34

2 Answers2

1

First, use the answer to this question to determine the size of your file. Then divide that number by two. Read the input line by line, and write it to the first output stream; check file.tellg() after each call. Once you're past the half-way point, switch the output to the second file.

This wouldn't split the strings evenly between the files, but the total number of characters in these strings should be close enough, and it wouldn't split your file in the middle of a string.

Community
  • 1
  • 1
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • @Dunc You could repeatedly read data into a buffer to avoid reading the file line-by-line. Then you could find the `'\n'` closest to the middle of the buffer once your `tellg` goes past the midpoint of the file. – Sergey Kalinichenko Dec 22 '11 at 11:32
  • @Dunc: in your question you talked about lines, having about the same number of lines in the two result streams. Now you say you don't want to read line by line. That is a trivial self-contradiction. – Cheers and hth. - Alf Dec 22 '11 at 13:21
0

Think of it as a relational database with one huge table. In order to find a certain piece of data, you can either do a sequential scan over the entire table, or use an index (which must be usable for the type of query you want to perform).

A typical index for a text file would be a list of offsets inside the file, sorted by the index expression. If the csv file is sorted by a specific column already, then the offsets in the index would be ascending, which is useful to know when building the index.

So basically you have to read the file once anyway, to find out where lines end; this is the index for the sort column. To find a particular element, use a binary search, using the index to find individual elements in the data set.

Depending on the data type, you can extend your index to allow for quick comparison without reading the actual data table. For example, in a word list you could keep the first four letters of the word next to the offset, which allows you to get into the right area quickly and only requires data reads for the last accesses (which you can then optimize to a sequential scan, as filesystems handle that a lot better).

The same technique can be applied to the other columns as well; the offsets stored in the index would no longer be ascending in file order, of course.

Since it is CSV data, a special case also applies: If the only index you have is in the same order as the file data itself and the end of record can be determined easily (that is, either you have a fixed record length, or there is a clear record separator, such as an EOL character), then building the actual index can be omitted and the values guessed (for fixed length records, offset is always equal to record length times offset in the index; for separated records you can just jump into the middle of a record and seek for the next terminator; be aware that there are nasty corner cases with binary search here). This does however mean that you will always be reading data pages here, which is less efficient than just reading the index.

Simon Richter
  • 28,572
  • 1
  • 42
  • 64