1

I posted one question, which was related to faster reading of a file, by skipping specific lines but that does not seem to go well with standard c++ api's.

I researched more and got to know what memory mapped files could come handy for these kinds of cases. Details about memory mapped files are here.

All in all,

Suppose, the file(file.txt) is like this:

A quick brown fox 
// Blah blah
// Blah blah
jumps over the little lazy dog

And then in code, opened file, Read that as memory mapped file and then iterate over the contents of the char* pointer, skipping the file pointers itself. Wanted to give it a try before reaching to an conclusion on it. Skeleton of my code looks like this:

struct stat filestat;
FILE *file = fopen("file.txt", "r");
if (-1 == fstat(fileno(file), &filestat)) {
  std::cout << "FAILED with fstat" << std::endl;
  return FALSE;
} else {
  char* data = (char*)mmap(0, filestat.st_size, PROT_READ, MAP_PRIVATE, fileno(file), 0);
  if (data == 0) {
    std::cout << "FAILED " << std::endl;
    return FALSE;
  }
  // Filter out 'data'
  // for (unsigned int i = 0; i < filestat.st_size; ++i) {
  //   Do something here..
  // }

  munmap(data, filestat.st_size);
  return TRUE;
}   

In this case, I would want to capture lines which does not start with //. Since this file(file.txt) is already memory mapped, I could go over the data pointer and filter out the lines. Am I correct in doing so?

If so, what is the efficient way to parse the lines?

Hemant Bhargava
  • 3,251
  • 4
  • 24
  • 45
  • Everything you have posted in either question shows you still read everything. You are not skipping lines, because lines don't exist in the file metadata. getline still reads the full line. Loading the file into memory still reads the whole file (although it might be faster, because it can read in optimal chunks). The correct and efficient way to parse the lines still involves looking at every single character. – Kenny Ostrom Dec 03 '18 at 13:09
  • 1
    @KennyOstrom, Yes. Agree. But in this case, I would skip dealing with file pointers and deal with the whole file. You are right that it might be faster due to chunk read resulting in a faster speed. Right? Question was : what should I write to parse the lines efficiently to get desired output. – Hemant Bhargava Dec 03 '18 at 17:17
  • @HemantBhargava: The buffering used in normal text I/O is precisely to get that “chunk read” speed anyway. – Davis Herring Dec 03 '18 at 22:46
  • I'm convinced you already had the most efficient way to read the file, before all this, and your refusal to just run a profiler and find the real problem is the problem. The idea that getline is causing you so much trouble that you have to go learn memory mapping techniques is ... not plausible. In big-O terms, that's irrelevant. (also this is what you were told in the other question). You can't do better than O(n) unless of course you are re-parsing a file you have already read -- if that's true, you can store the file-position of each non-comment line in an index file. – Kenny Ostrom Dec 04 '18 at 00:18
  • @KennyOstrom, I ran my own version of profiles and know that reading file itself is the culprit. I have tried various methods written at: http://insanecoding.blogspot.com/2011/11/how-to-read-in-file-in-c.html but none of them gives me good runtime. That is the reason I choose mmap. I am reading the file for the first time and understand that I can not do better than O(n). My problem is down to writing a piece of code which can iterate over the memory and get the lines which I want. – Hemant Bhargava Dec 04 '18 at 04:25
  • In which case, do you know if it is the sheer size of the file, or the memory usage to store the lines while reading? I apologize for being a little blunt earlier. – Kenny Ostrom Dec 04 '18 at 13:34
  • Is this question just how do you parse text in memory? char* and ++ – Kenny Ostrom Dec 04 '18 at 20:16
  • @KennyOstrom, Yes, now the problem is down to "How to parse text from memory". Text is an stream of characters.. I could use getline() but do not want to do that for runtime purposes. Something like this: https://stackoverflow.com/questions/13535672/read-a-file-line-by-line-with-mmap?rq=1 – Hemant Bhargava Dec 05 '18 at 05:03
  • That's a whole different question. It's been answered. Use getline like it says. If that is too slow, then too bad, you can't read the file without reading the file. Try timing it with just getline and don't do anything with it (in case you are slowing it down a lot doing other stuff but not realizing it). Don't store it. Don't print it. Nothing. – Kenny Ostrom Dec 05 '18 at 14:12

1 Answers1

0

Reading selected lines from wherever and copy them to whatever can be done with the C++ algorithms.

You can use std::copy_if. This will copy data from any source to any destination, if the predicate is true.

I show you a simple example that copies data from a file and skips all lines starting with "//". The result will be put in a vector.

This is one statement with calling one function. So, a classical one liner.

For debugging purposes, I print the result to the console.

#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>
#include <string>
#include <fstream>

using LineBasedTextFile = std::vector<std::string>;

class CompleteLine {    // Proxy for the input Iterator
public:
    // Overload extractor. Read a complete line
    friend std::istream& operator>>(std::istream& is, CompleteLine& cl) { std::getline(is, cl.completeLine); return is; }
    // Cast the type 'CompleteLine' to std::string
    operator std::string() const { return completeLine; }
protected:
    // Temporary to hold the read string
    std::string completeLine{};
};

int main()
{
    // Open the input file
    std::ifstream inputFile("r:\\input.txt");
    if (inputFile)
    {
        // This vector will hold all lines of the file
        LineBasedTextFile lineBasedTextFile{};
        // Read the file and copy all lines that fullfill the required condition, into the vector of lines
        std::copy_if(std::istream_iterator<CompleteLine>(inputFile), std::istream_iterator<CompleteLine>(), std::back_inserter(lineBasedTextFile), [](const std::string & s) {return s.find("//") != 0; });
        // Print vector of lines
        std::copy(lineBasedTextFile.begin(), lineBasedTextFile.end(), std::ostream_iterator<std::string>(std::cout, "\n"));
    }
    return 0;
}

I hope this helps

A M
  • 14,694
  • 5
  • 19
  • 44