0

I have a system that loads some text files that are zipped into a ".log" file and parse then into informational classes using multiple threads that each deals with a different file and adds the parsed objects to a list. The file is loaded using TStringList, since it was the fastest method that I tested.

The number of text files is variable but normally I have to deal with something between 5 to 8 files ranging from 50Mb to 120Mb in one incursion.

My problem: The user can load the .log files as many times they desire, and after some of those processes I receive an EOutOfMemory exception when trying to use TStringList.LoadFromFile. Of course, the first thing that comes to mind to anyone that has ever used a StringList is that you should not use it when dealing with big textfiles, but this exception happens randomly and after the process has already been completed successfully at least once (the objects are destroyed before the start of a new parsing so the memory is retrieved correctly apart from some minor leaks)

I tried using textile and TStreamReader but it's not as fast as TStringList and the duration of the process is the greatest concern with this feature.

I'm using 10.1 Berlin, the parse process is a simple iteration trough the list of varied length lines and construction of objects based on the line info.

Essentially, my question is, what is causing this and how can i fix it. I may use other ways to load the file and read its contents but it must be as fast (or better) as the TStringList method.

Loading thread execute code:

TThreadFactory= class(TThread)
  protected
     // Class that holds the list of Commands already parsed, is owned outside of the thread
    _logFile: TLogFile;
    _criticalSection: TCriticalSection;
    _error: string;

    procedure Execute; override;
    destructor Destroy; override;

  public
    constructor Create(AFile: TLogFile; ASection: TCriticalSection); overload;

    property Error: string read _error;

  end;

implementation


{ TThreadFactory}

    constructor TThreadFactory.Create(AFile: TLogFile; ASection: TCriticalSection);
    begin
      inherited Create(True);
      _logFile := AFile;

      _criticalSection := ASection;
    end;


    procedure TThreadFactory.Execute;
        var
          tmpLogFile: TStringList;
          tmpConvertedList: TList<TLogCommand>;
          tmpCommand: TLogCommand;
          tmpLine: string;
          i: Integer;
        begin
          try
            try
              tmpConvertedList:= TList<TLogCommand>.Create;       

                if (_path <> '') and not(Terminated) then
                begin

                  try
                    logFile:= TStringList.Create;
                    logFile.LoadFromFile(tmpCaminho);

                    for tmpLine in logFile do
                    begin
                      if Terminated then
                        Break;

                      if (tmpLine <> '') then
                      begin
                        // the logic here was simplified that's just that 
                        tmpConvertedList.Add(TLogCommand.Create(tmpLine)); 
                      end;
                    end;
                  finally
                    logFile.Free;
                  end;

                end;


              _cricticalSection.Acquire;

              _logFile.AddCommands(tmpConvertedList);
            finally
              _cricticalSection.Release;

              FreeAndNil(tmpConvertedList);    
            end;
          Except
            on e: Exception do
              _error := e.Message;
          end;
        end;

    end.     

Added: Thank you for all your feedback. I will address some issues that were discussed but I failed to mention in my initial question.

  • The .log file has multiple instances of .txt files inside but it can also have multiple .log files, each file represents a day worth of logging or a period selected by the user, since the decompression takes a lot of time a thread is started every time a .txt is found so I can start parsing immediately, this has shortened the noticeable waiting time for the user

  • The "minor leaks" are not shown by ReportMemoryLeaksOnShutdown and other methods like TStreamReader avoid this issue

  • The list of commands is held by TLogFile. There is only one instance of this class at any time and is destroyed whenever the user wants to load a .log file. All threads add commands to the same object, that's the reason for the critical section.

  • Can't detail the parse process since it would disclose some sensible information, but it's a simple information gathering from the string and the TCommand

  • Since the beginning I was aware of fragmentation but I never found concrete proof that TStringList causes the fragmentation only by loading multiple times, if this can be confirmed I would be very glad

Thank you for you attention. I ended up using an external library that was capable of reading lines and loading files with the same speed as TStringList without the need to load the whole file into memory

https://github.com/d-mozulyov/CachedTexts/tree/master/lib

TioGuedes
  • 43
  • 8
  • 1
    The solution is obvious. Don't read the entire file into memory. – David Heffernan Oct 24 '17 at 13:36
  • 2
    I suspect file itself is not a problem, but rather what is done with that file... specifically "Insert tmpConvertedList into a commands list" But there is not enough information to tell for sure. Above code does not provide proper [mcve] – Dalija Prasnikar Oct 24 '17 at 13:39
  • 1
    Also, while there is a claim objects are destroyed correctly, we cannot confirm that... and "apart from some minor leaks" can cause memory fragmentation and prevent further processing of large amounts of data. – Dalija Prasnikar Oct 24 '17 at 13:42
  • @DavidHeffernan Sorry if I was not clear enough, during the process having the whole file (multiples at the same time actually) is not the problem. After the parse is over, the user can just to it all again using the same or any other log file, during those new attempts I get the exception randomly, at the first created thread or at the 6th at a time that I do have enough memory. – TioGuedes Oct 24 '17 at 13:45
  • @DalijaPrasnikar I will change my code so it can be better read, thanks for the tip – TioGuedes Oct 24 '17 at 13:48
  • It seems likely to me that fragmentation due to loading the entire files is a likely cause of the problem. But yeah, we are all guessing. Your requirement that any solution must be at least as fast as your current code that does not work is odd. If you are content with code that doesn't work, why are you asking at all? Anyway, if you want something more than our speculation, create a [mcve]. – David Heffernan Oct 24 '17 at 13:56
  • @DavidHeffernan I am fully aware that the code does not work correctly, and I was against using TStringList, my solution used TStreamReader and I had no problems or leaks whatsoever, but my superiors want me to figure out how to handle this with TStringList since is almost 10 seconds faster in our common test case scenario. The "leaks" are not shown by "ReportMemoryLeaksOnShutdown" but its visible that each iteraction of the process uses some more memory (around 8Mb). This is not seen using TMemoryStream. – TioGuedes Oct 24 '17 at 14:14
  • 1
    `TStreamReader` is not very performant, IIRC. Use a better line reader implementation. It's disappointing that you didn't mention any of this in your post. Smells very much like fragmentation. However, if you superiors are demanding that you continue to use `TSrtingList`, I think you may want to spend some time with your CV. – David Heffernan Oct 24 '17 at 14:21
  • 2
    Your problem asside, I'd suggest you review the proper use of try-finally... – Ken Bourassa Oct 24 '17 at 14:27
  • 1) `TStringList` is slow class per se. Try array - `System.IOUTils.TFile.ReadAllLines` 2) Read about Heap Memory Fragmentation, for example https://en.wikipedia.org/Heap_fragmentation 3) find some library that read huge text files without loading them entirely. Maybe https://github.com/d-mozulyov/CachedTexts – Arioch 'The Oct 24 '17 at 14:31
  • Also - try to remove multi-threading and use your program in single-thread moment. Maybe you have a leak triggered by multithreading, test for it. Also, since HDD speed is usually much slower than CPU speed, it rarely has any speed benefit to have multi-threaded HDD read-write programs, and complexity (error chance) is increased hugely. – Arioch 'The Oct 24 '17 at 14:33
  • `parse then into informational classes using multiple threads` also - those classes may have SMP-related errors too. For example, why would u use external Critical Section (which by error may be different in different threads, it is not bound with the receiving list) instead of just using good -old- new `TThreadList` and avoiding explicit CSes at all? Also it is totally unclear to me if there is someone bothering freeing those `TLogCommand` objects you create so many – Arioch 'The Oct 24 '17 at 14:40
  • @DalijaPrasnikar indeed, this "apart from some minor leaks" sounds like a sick joke. I wonder if there are (or even can be, for SMP case) heap managers with Mark/Retain model instead of Alloc/Free, like that old exotic mode in Turbo Pascal – Arioch 'The Oct 24 '17 at 14:43
  • @KenBourassa Could you expand on you comment? What is wrong with my try-finally? Not disagreeing, just want to improve – TioGuedes Oct 24 '17 at 16:06
  • @DavidHeffernan Indeed `TStreamReader` has a very slow reading performance and reading the lines is the bulk of my processing time. I have tried using `TextFile` and `TStreamReader` but they rend slower results. Wich line reader implementation would you (any of you actually) use in this case so that I don't lose much performance – TioGuedes Oct 24 '17 at 16:13
  • You'd probably need to roll your own. I personally use `TStringReader`, but load the string myself from a file, can't remember why. When you used `TStreamReader` how did you read the text? – David Heffernan Oct 24 '17 at 16:22
  • @DavidHeffernan loaded the file with `TStreamReader.Create('FilePath')` and read using `TStreamReader.ReadLine` inside a loop – TioGuedes Oct 24 '17 at 16:42
  • Try using `TStringReader` instead. You'll first have to read the entire file into a string. – David Heffernan Oct 24 '17 at 16:44
  • @DavidHeffernan I will try, thanks. Will try also CachedTexts that was recommended by Arioch – TioGuedes Oct 24 '17 at 16:51
  • @Arioch'The I ended up using d-mozuly's CachedTexts it was exactly what I wanted, so if you answer my question properly I will accept it – TioGuedes Oct 25 '17 at 09:27
  • It is funny how there is many other custom fast files streaming classes. AFAIR @DavidHeffernan made his own library too once ago :-D – Arioch 'The Oct 25 '17 at 09:32
  • @Arioch'The Indeed, I was very impressed by the one you recommended, there are surely other good ones, for now CachedTexts is doing some nice work but I am open to try other possible options – TioGuedes Oct 25 '17 at 10:26
  • "TLogFile. There is only one instance of this class at any time and is destroyed whenever the user wants to load a .log file. All threads add commands to the same object, that's the reason for the critical section." - and that is fragile. That makes your program dependent upon "good behavior" by ALL readers and ALL writers in ALL the places the read an write. Both today and tomorrow. Check Delphi sources, classes `TList` and `TThreadList`. Check OTL's `TOmniBlockingCollection` later (it is much more complex). All synchronization (CS or otherwise) is incapsulated inside those classes – Arioch 'The Oct 25 '17 at 11:00
  • You better make `TThreadAwareLogFile` class, that exposes same methods as `TLogFile` but inside itself ensures all reads and all writes are made in multithreading-synchronized way. Then all your readers and writers would be free to mind only their own business w/o thinking of other threads. Actually, that is one of OTL's goals: make mt-application of single-threaded isolated procedures. Really, look into its blog and tutorials – Arioch 'The Oct 25 '17 at 11:02

1 Answers1

6
  1. TStringList is slow class per se. It has a lot of -bells and whistles- extra features and functions that bog it down. Much faster containers would be TList<String> or plain old dynamic array of string. See System.IOUTils.TFile.ReadAllLines function.

  2. Read about Heap Memory Fragmentation, for example http://en.wikipedia.org/Heap_fragmentation

It can happen and break your application even without memory leaks. But since you say there are many small leaks - that is what most probably happen. You can more or less delay the crash by avoiding reading whole files into memory and operating with smaller chunks. But degradation would still go on, even slower, and in the end your program would crash again.

  1. There are a lot of ad hoc classes libraries, reading large files piece after piece with buffering, pre-fetching and what not. One of such kind of libraries, targeted at texts, is http://github.com/d-mozulyov/CachedTexts and there are others too.

PS. General notes.

I think your team should reconsider how much need for multithreading you have. Frankly, I see none. You are loading files from HDD and probably you write processed and transformed files to the same (at best to some another) HDD. That means, your program speed is limited with disk speed. And that speed is MUCH less than speeds of CPU and RAM. By introducing multithreading you seem only to make your program more complex and fragile. Errors are much harder to detect, well known libraries may suddenly misbehave in MT mode, etc. And you probably get no performance increase, because the bottleneck is at disk I/O speed.

If you still want multithreading for the sake of it - then perhaps look into OmniThreading Library. It was designed to simplify developing "data streams" types of MT applications. Read the tutorials and examples.

I definitely suggest you to squash all those "some minor leaks" and as part of it to fix all compilation warnings. I know, it is hard when you are not the only programmer at the project and others do not care. Still "minor leaks" means none on your team knows how the program actually behaves or behaved. And non-deterministic random behavior in multi-threading environment can easily generate tonnes of random Shroeden-bugs which you would never be able to reproduce and fix.

Your try-finally pattern really is broken. The variable you clean up in finally block should be assigned right before try block, not within it!

o := TObject.Create;
try
  ....
finally
  o.Destroy;
end;

This is correct way:

  • either the object fails to be created - then try-block would not be entered and neither would be finally-block.
  • or the object is successfully created - and then try-block would be entered and so would be finally-block

So, sometimes,

o := nil;
try
  o := TObject.Create;
  ....
finally
  o.Free;
end;

This is also correct. The variable is set to be nil immediately before try-block is entered. If object creation fails, then when finally-blocks calls Free method the variable was already assigned, and TObject.Free (but NOT TObject.Destroy) was designed to be able to work on nil object references. By itself is just a noisy, overly verbose modification of the first one, but it serves as a foundation to few more derivatives.

That pattern may be used when you do not know would you create an object or not.

o := nil;
try
  ...
  if SomeConditionCheck() 
     then o := TObject.Create;  // but maybe not
  ....
finally
  o.Free;
end;

Or when object creation is delayed, because you need to calculate some data for its creation, or because the object is very heavy (for example globally blocking access to some file) so you strive to keep its lifetime as short as possible.

o := nil;
try
  ...some code that may raise errors
  o := TObject.Create; 
  ....
finally
  o.Free;
end;

That code though asks why the said "...some code" was not moved outside and before the try-block. Usually it can and should be. A rather rare pattern.

One more derivative from that pattern is used when creating several objects;

o1 := nil;
o2 := nil;
o3 := nil;
try
  o2 := TObject.Create;
  o3 := TObject.Create;
  o1 := TObject.Create;
  ....
finally
  o3.Free;
  o2.Free;
  o1.Free;
end;

Goal is, if for example o3 object creation fails, then o1 would get freed and o2 was not created and the Free calls in finally-block would know it.

That is semi-correct. It is assumed that destructing objects would never raise its own exceptions. Usually that assumption is correct, but not always. Anyway, this pattern lets you fuse several try-finally blocks into one, which makes source code shorter (easier to read and reason about) and execution a little bit faster. Usually this is also reasonably safe, but not always.

Now two typical misuses of the pattern:

o := TObject.Create;
..... some extra code here
try
  ....
finally
  o.Destroy;
end;

If the code BETWEEN object creation and try-block raises some error - then there is no anyone to free that object. You just got a memory leak.

When you read Delphi sources you see maybe there a similar pattern

with TObject.Create do
try
  ....some very short code
finally
  Destroy;
end;

With all the wide-spread zeal against any use of with construct, this pattern precludes adding extra code between object creation and try-guarding. The typical with drawbacks - possible namespaces collision and inability to pass this anonymous object to other functions as an argument - are included.

One more unlucky modification:

o := nil;
..... some extra code here
..... that does never change o value
..... and our fortuneteller warrants never it would become
..... we know it for sure
try
  ....
  o := TObject.Create;
  ....
finally
  o.Free;
end;

This pattern is technically correct, but rather fragile at that. You do not immediately see the link between o := nil line and the try-block. When you would develop the program in the future, you may easily forget it and introduce the errors: like copy-pasting/moving the try-block into another function and forgetting the nil-initializing. Or extending the in-between code and making it use (thus - change) value of that o. There is one case i sometimes use it, but it is very rare and comes with risks.

Now,

...some random code here that does not
...initialize o variable, so the o contains
...random memory garbage here
try
  o := TObject.Create;
  ....
finally
  o.Destroy; // or o.Free
end;

This is what you write a lot without thinking how try-finally works and why it was invented. The problem is simple: when you enter the try-block your o variable is a container with random garbage. Now when you try to create the object, you may face some error raised. What then? Then you go into the finally-block and call (random-garbage).Free - and what should it do? It would do the random garbage.

So, to repeat all the above.

  1. try-finally is used to warrant object freeing or any other variables cleanup (closing files, closing windows, etc), and hence:
  2. the variable used to track that resource (such as object reference) should have well known value on the entrance into the try-block, this it should be assigned (initialized) before the try keyword. If you guard the file - then open it immediately before try. If you guard against memory leak - create the object before try. Etc. Do not do our first initialization AFTER try operator - WITHIN try-block - it is too late there.
  3. you better design the code as simple (self-evident) as you can, eliminating potential to introduce future errors when you would forget non-explicit hidden assumptions you keep in the corner of your mind today - and would cross them. See Who wrote this programing saying? "Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." . Here it means, initialize (assign) the variable guarded by try-block IMMEDIATELY before the block starts, right above the try keyword. Better even, insert an empty line before that assignment. Make it jump into your (or any other reader's) eyes that this variable and this try are mutually dependent and should never be broken apart.
Arioch 'The
  • 15,799
  • 35
  • 62
  • 2
    Very good answer, thank you for your time and concerns with our application, the leaks are gone , as I said, since the beggining they were due to some misuse of `TStringList`. Also, thanks for the class on `try-finally` – TioGuedes Oct 25 '17 at 10:57
  • @TioGuedes We had similar issues in some places with `try-finally` and it bit us when we moved to XE2 / XE7 and 64Bit. As local object references weren't magically initialized with `nil` anymore. – nil Oct 25 '17 at 11:30
  • 1
    `local object references weren't magically initialized with nil anymore` - did they ever? Only global references were zeroed. Also automatic refcounted types (interfaces, dynarrays, strings). But plain object references? Not until LLVM I think, @nil – Arioch 'The Oct 25 '17 at 12:47
  • I did not work on that issue back then, but we talked in the team about it. But curious after reading your comment, Arioch 'The, I could find tickets in our bug tracking system, and could see it happen when debugging yesterday. I can't make a MCVE, it is not even happening 'everywhere' in the same project. I believe it to be some freak accident, depending on what content was in registers / stack. But the thing is the code was written with dependency on this freakish behavior, pretty much a bad example of your conditional creation example, missing explicit assignment before `try`-block. – nil Oct 26 '17 at 09:56
  • @nil I just recently found and squashed a shroeden-bug when uninitialized local Boolean variable was used to cancel the procedure. When some other BPLs were compiled for Release - the variable was zero (1 chance in 256) and the bug did not manifested. When I internally deployed test versions, Debug-compiled - then the variable was some other garbage, non-zero. The variable was registry-mapped, so the bug dependent upon which garbage other procedures (used object methods)were leaving in CPU. But from outside the asm black box it would really look like boolan local var was zeroed in Release – Arioch 'The Oct 26 '17 at 11:29