1

Earlier today I opened a question here asking if my method to scan files in computer was correct. As solution, I received a few tips, and the one of the solutions I thought: "this need to be solved urgent!", was saying about memory overflow, once I was reading the files entirely in memory. So I started trying to find a way to read the files piece by piece, and I got something (wrong/bogus), that I need some help to figure out how to do this correctly. The method is simple like this for now:

procedure ScanFile(FileName: string);
const
  MAX_SIZE = 100*1024*1024;
var
  i, aux, ReadLimit: integer;
  MyFile: TFileStream;
  Target: AnsiString;
  PlainText: String;
  Buff: array of byte;
  TotalSize: Int64;
begin
  if (POS('.exe', FileName) = 0) and (POS('.dll', FileName) = 0) and
      (POS('.sys', FileName) = 0) then //yeah I know it's not the best way...
    begin
      try
        MyFile:= TFileStream.Create(FileName, fmOpenRead);
      except on E: EFOpenError do
        MyFile:= NIL;
      end;
      if MyFile <> NIL then
      try
        TotalSize:= MyFile.Size;
        while TotalSize > 0 do begin
          ReadLimit:= Min(TotalSize, MAX_SIZE);
          SetLength(Buff, ReadLimit);
          MyFile.ReadBuffer(Buff[0], ReadLimit);
          PlainText:= RemoveNulls(Buff); //this is to transform the array of bytes in string, I posted the code below too...
          for i:= 1 to Length(PlainText) do
            begin //Begin the search..
            end;
          dec(TotalSize, ReadLimit);
         end;
  finally
    MyFile.Free;
  end;
end;

Code for RemoveNulls is:

function RemoveNulls(const Buff: array of byte): String;
var
  i: integer;
begin
  for i:= 0 to Length(Buff) do
    begin
      if Buff[i] <> 0 then
        Result:= Result + Chr(Ord(Buff[i]));
    end;
end;

Ok, the problems I got with this code so far was:

1- each time the while is repeated, I get more memory consumed, when I was expecting to get only MAX 100MB as described in the MAX_SIZE variable, right?

2- I created a file with 2 occurrences of what should be filtered, and for some unknown reason I got about 10 repeated occurrences, looks like I'm scanning the file repeatedly.

I appreciate your help guys, and if someone have this kind of code already done, post here please, I don't pretend to re-create the wheel...

user1526124
  • 229
  • 4
  • 20
  • 3
    See [`Buffered files (for faster disk access)`](http://stackoverflow.com/a/5639712/576719) by @DavidHeffernan. – LU RD Nov 21 '13 at 19:48
  • That's big amount of code which I don't understand. I know may it solve my problem, but I prefer if possible to make something simple. Thank you for your attention. – user1526124 Nov 21 '13 at 19:55
  • 1
    Re: 2) and 4) that's because you are requesting exactly 100MiB from the file, while you have to request `Min(Count, MAX_SIZE)`. I suggest to rewrite (for simplicity of exercise, do not handle exceptions for now). – Free Consulting Nov 21 '13 at 19:56
  • @FreeConsulting yes, I did now: `while Count > 0 do begin N:= Min(Count, MAX_SIZE); SetLength(Buff, N); MyFile.ReadBuffer(Buff[0], N);` and it solved the 2 and 4... Thank you! – user1526124 Nov 21 '13 at 20:13
  • There's not much simplification to be made for my buffered stream. – David Heffernan Nov 21 '13 at 22:23
  • David, the problem is that I'm noob in coding, even with this delphi class that are 'easy to use'. Your buffered stream is complicated and I don't understand 1 line of what's happening. I can't understand why this code keep consuming more and more memory if I'm using the 100MB limit... I edited the code for the one I have right now. Can you take a look? Looks like every while it consumes more memory, if the file have 1GB+ it gives out of memory error. Do you know why? If I solve this problem is enough for me... Thanks anyway. – user1526124 Nov 21 '13 at 22:33
  • 2
    Please don't modify the question as you progress along, "2) and 4)" in the comments do not make much sense now. – Sertac Akyuz Nov 21 '13 at 22:49
  • If you read it all into memory, then for sure you'll run out. But just read a piece at a time. – David Heffernan Nov 21 '13 at 23:01
  • @DavidHeffernan That's what i'm trying to do with: `MyFile.ReadBuffer(Buff[0], ReadLimit);` for some unknown reason don't work. @SertacAkyuz yes, you right... I'll try to re-edit and get back with the 2 and 4... Thank for the tip. – user1526124 Nov 21 '13 at 23:07
  • MyFile: TFileStream; ... It's on the code above my friend... – user1526124 Nov 22 '13 at 00:03
  • Anyway, no need for a fancy stream here – David Heffernan Nov 22 '13 at 04:14

1 Answers1

4

I'd say that RemoveNulls is your problem. Suppose that you just read 100MB into a string that you passed to RemoveNulls. You would then allocate a string of length 1. The reallocate to length 2. Then to length 3. Then to length 4. And so on, all the way to length 100*1024*1024.

That process will fragment your memory, as well as being appallingly slow. Heap allocation is to be avoided when performance matters. You've no need for it at all. Read a chunk of the file, and search directly in the buffer that you read.

There are various problems with your code that I can see:

  1. Your file extension check is broken, as I described in your previous question.
  2. You are not handling exceptions correctly, as I described in your previous question.
  3. Your for loop in RemoveNulls has buffer overrun. Loop from low() to high().

It's not possible to comment on the search code since that's not present in the question.

David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • David, sorry for the late answer. Yes, I checked and when I don't call the RemoveNulls the program works without memory leak. This can be fixed by looping from low to high? Instead of using length? The other 2 problems you saw I'll be trying to fix, but I started trying the most critical. Thank's for the help. – user1526124 Nov 22 '13 at 14:31
  • Fix RemoveNulls problems by removing it and not doing that. You've already got the content in `Buff`. Make your search operate on `Buff`. – David Heffernan Nov 22 '13 at 14:34
  • Yes I know what you mean, and I already did something like that. But the problem is that I'm dealing with many file types, and some of them, have for example ABC... Other have A{NUL}B{NUL}C{NUL}... Other **can** have anything like A{NUL}{NUL}B{NUL}C... That's why I'm trying to 'remove the nulls and let only the Chr values to make the search. That's why I'm using that bogus function. Any tips into this problem? – user1526124 Nov 22 '13 at 14:37
  • Those files are UTF-16. So process them as UTF-16. Removing the zero bytes is just wrong. But even if you want to ignore nul, even if that was the right thing to do, you should still process `Buff` in place. What are you really trying to do? What will this program do? – David Heffernan Nov 22 '13 at 14:39
  • I'm developing some app to make a security check in many servers that use SQL backups. This backups can't contain sensitive information (at least not in plain text). So I search for this information and show if they are found. And yes, the files are UTF-16, but how to process UTF-8,UTF-16, ANSII, Unicode, everything same time? – user1526124 Nov 22 '13 at 14:42
  • You have to know how the text is encoded. It cannot always be deduced. Anyway, I think I answered the question that you asked. – David Heffernan Nov 22 '13 at 14:51
  • Yes you always answer... I just didn't accepted because I still here discussing. But I'll accept as answer... I just need something to remove this nulls. Anyway this question is finished. But can you help-me in removing those nulls? – user1526124 Nov 22 '13 at 14:55
  • 1
    I don't think that you should remove the nulls! If you must here's what you do. Allocate an ansistring the same length as the buffer. Initialise a variable, `strIndx` to 0. Walk over the buffer. When you find a character that is not null, write it to the string: `inc(strIndx); str[strIndx] := buff[buffIndx];`. When you are done, set the length of `str` to `strIndx`. That way you avoid all the repeated heap allocations. – David Heffernan Nov 22 '13 at 14:59
  • I'll try that. Thank you again... Just one personal question if you don't mind to answer... You learned programming language in school or alone? 90% of my questions here you answer. Looks like you know everything. That's great. – user1526124 Nov 22 '13 at 15:01
  • 1
    I certainly don't know everything. One reason I spend lots of time here is to learn more. I'm self-taught. Started aged 10. Studied pure maths at university. Fell into a programming job for numerical code. Learnt the rest on the job. – David Heffernan Nov 22 '13 at 15:05