0

I have a large file containing strings. I have to read this file and store it in a buffer using C or C++. I tried to do it as follows:

FILE* file = fopen(fileName.c_str(), "r");
assert(file != NULL);
size_t BUF_SIZE = 10 * 1024 * 1024;
char* buf = new char[BUF_SIZE];
string contents;
while (!feof(file))
{
    int ret = fread(buf, BUF_SIZE, 1, file);
    assert(ret != -1);
    contents.append(buf);
}

The data in the file would be the strings and i have to find the character with maximum frequency. Is it possible to optimize the code more than this ? Will using BinaryReader improve optimisation ? Could you share some more ways if you know?

Rahul Nori
  • 698
  • 6
  • 17
xav xav
  • 231
  • 2
  • 5
  • 12
  • 1
    You can not use `new` in C, change to `malloc`, and take a look to [Why is “while ( !feof (file) )” always wrong?](http://stackoverflow.com/questions/5431941/why-is-while-feof-file-always-wrong) – David Ranieri Oct 08 '15 at 07:03
  • I think OP has already asked a similiar question here - http://stackoverflow.com/questions/33007156/the-best-optimal-way-to-find-the-frequency-in-a-very-very-long-string – Nishant Oct 08 '15 at 07:07
  • @Nishant thansk for comment but sorry ,please read the question, it's how to read a file in optmised way and store in buffer? Not how to find frequency of string. – xav xav Oct 08 '15 at 07:10
  • *"i have to read this file and store in buffer in C or C++."* - why exactly? That's highly undesirable if your ultimate goal is simply *" to find the character with maximum frequency"*... you're better off reading chunks to a decent-but-not-excessively-sized buffer (e.g. 16k), processing the data therein to update your frequency counters, then reading the next chunk to the same buffer. The *"updating frequency counters"* logic is much the same as your question Nishant linked. The C++ Standard Library doesn't have a "BinaryReader" - please name the library when mentioning other functions. – Tony Delroy Oct 08 '15 at 07:32
  • Let's assume your goal is to read the whole contents of the file into memory (which is not always wise, files can be much larger than RAM). First you'd want to know the size of the file (use fseek), allocate your string to the correct length, then call fread accordingly, using raw pointers to read directly into the string, to minimise copies. You may also want to consider whether the string class is the appropriate data type to store this in, because the contents of the file may include invalid characters, or NULL characters, which would confuse string operations. – gigaplex Oct 08 '15 at 07:36
  • Well you could try reading the file using more threads, if your hardware has more than one core that would definitely be faster. – Marco Oct 08 '15 at 07:39
  • The loop should be `while((count=fread(...)) > 0)`. You should **not** call `contents.append(buf)`, that's just a huge waste of time. Just process the bytes in `buf` directly. Finally, using chunk sizes bigger than 128K doesn't do any good on any system that I'm aware of, since cache lines, virtual memory pages, and disk clusters, and flash sector sizes are typically less than or equal to 128K. – user3386109 Oct 08 '15 at 07:40
  • @gigaplex: `std::string` is fully capable of storing and handling any binary content including NULs... it doesn't get confused. Some people do recommend `std::vector` instead to highlight the nature of the content, but IMHO that's less convenient and not a net win. – Tony Delroy Oct 08 '15 at 08:11
  • 1
    @Marco: *"reading the file using more threads, if your hardware has more than one core that would definitely be faster"* - *definitely* is wrong; it *can* be sometimes - e.g. when the file content's on a striped RAID array and many physical disks can supply the data, but often it's slower because a single disk driver's pulled from one thread's read location to another and back repeatedly, spending less time streaming actual data. – Tony Delroy Oct 08 '15 at 08:14
  • @TonyD While` std::string` itself might be able to store the data just fine, some functions that expect "normal" string data might misbehave when they make assumptions on the data. Personally I'd take the `std::vector` approach for clarity. It was just a suggestion to reconsider, not a requirement. – gigaplex Oct 12 '15 at 01:20

0 Answers0