1

I would like to find the first occurence of an ANSI string in a binary file, using C++.

I know the string class has a handy find function, but I don't know how can I use it if the file is big, say 5-10 MB.

Do I need to copy the whole file into a string in memory first? If yes, how can I be sure that none of the binary characters get corrupted while copying?

Or is there a more efficient way to do it, without the need for copying it into a string?

hyperknot
  • 13,454
  • 24
  • 98
  • 153
  • possible duplicate of [How to delete parts from a binary file in C++](http://stackoverflow.com/questions/6447688/how-to-delete-parts-from-a-binary-file-in-c) – GWW Jun 22 '11 at 23:43
  • You already asked essentially the same quesiton – GWW Jun 22 '11 at 23:44
  • 1
    You can use a memory mapped file. You can use Boost: http://www.boost.org/doc/libs/1_46_1/libs/iostreams/doc/classes/mapped_file.html#mapped_file, or StlSoft: http://www.stlsoft.org/doc-1.9/classplatformstl_1_1memory__mapped__file.html – Pablo Jun 22 '11 at 23:45
  • @GWW: I have been told that I should split it into multiple questions. – hyperknot Jun 22 '11 at 23:46
  • 1
    @GWW: others told him at the other question that he should break the question into multiple questions, **and** at the other question he even linked to this one. – Loduwijk Jun 22 '11 at 23:52

3 Answers3

5

Do I need to copy the whole file into a string in memory first?

No.

Or is there a more efficient way to do it, without the need for copying it into a string?

Of course; open the file with an std::ifstream (be sure to open in binary mode rather than text mode), create a pair of multi_pass iterators (from Boost.Spirit) around the stream, then search for the string with std::search.

ildjarn
  • 62,044
  • 9
  • 127
  • 211
  • Good answer! I can't imagine more effective way. I wanted to offer use fstream::read() function and then search manually in read buffer. But my way is more difficult. – George Gaál Jun 22 '11 at 23:51
  • Nice solution. The only potential problem I see is that this will read the file in chunks of size equal to the stream's buffer size. Can this buffer size be adjusted? It might be more efficient to read larger chunks at a time. – HighCommander4 Jun 22 '11 at 23:52
  • @HighCommander4 : Good question. It may be possible by supplying a separately constructed `std::filebuf` instance, but to be frank, iostreams are not my area of expertise. – ildjarn Jun 22 '11 at 23:56
  • 1
    This invokes undefined behavior bacause `std::search` expects forward iterators or better, and `std::istream_iterator` is an input iterator. See my question about it [here](http://stackoverflow.com/questions/6449266/can-input-iterators-be-used-where-forward-iterators-are-expected). – Benjamin Lindley Jun 23 '11 at 04:33
  • @Benjamin : Totally correct, not sure how I forgot that when writing this answer. >_> Answer edited. – ildjarn Jun 23 '11 at 04:42
  • Cool, I did not know about the multi pass iterator +1. Spirit seems like the wrong place for that, even though obviously that can prove useful in parsing files. – Benjamin Lindley Jun 23 '11 at 05:05
  • @Benjamin : It was originally developed for Spirit 1.x, but I definitely agree, it's a shame it hasn't been put into a more general library like Boost.Iterator or Boost.Utility. – ildjarn Jun 23 '11 at 05:16
2

First of all, don't worry about corrupted characters. (But don't forget to open the file in binary mode either!) Now, suppose your search string is n characters long. Then you can search the whole file a block at a time, as long as you make sure to keep the last n-1 characters of each block to prepend to the next block. That way you won't lose matches that occur across block boundaries. So you can use that handy find function without having to read the whole file into memory at once.

TonyK
  • 16,761
  • 4
  • 37
  • 72
  • 2
    I think the zsero should specify if there is a limit on the length of the target. If he searches for the contents of a text file as the target, searching for that content in another file, the target itself could easily be longer than a couple of blocks, in which case this will always fail. If he can guarantee target is small, then this is a good optimization. – Loduwijk Jun 22 '11 at 23:56
0

if you can mmap the file into memory, you can avoid the copy.

lhf
  • 70,581
  • 9
  • 108
  • 149