12

Updates

2016-02-18: Added process information


I have a Delphi program compiled using XE4. It is being used by a few hundred customers. A couple of weeks ago one of these customers reported that some areas of the executable was being erased (images bellow) randomly during the day. This client has 35 sites using this exe and the problem occurs on no more than 10 of these sites.

Investigation

1 - My first suspicion was an infinite loop. The exe keeps responding while the components are erased, nothing changed on the code so radically from the time this problem did'n happen and the logs don't show any loop (this exe has logs everywhere).

2 - Misbehaving threads. I have a separate thread that syncs data between this exe and our server in the cloud. Again, logs don't show that the thread is running when the problem occur and again, nothing was changed here.

3 - Some other program (antivirus?) is affecting my exe. Couldn't investigate this hipotesis properly yet, but until now couldn't find any installed program that raised my attention.

My question is: What could be causing this issue? How can I investigate further? I know this may be a wide question but this is all information I could gather and I can't imagine many more places to look at.

Images

1 - In the image bellow the red-stroked area should be a TToolBar

erased TToolBar

2 - In this second image there are three areas, from the top to the bottom the first one should be a TToolBar, the second one should be the title of the child form and the third one should be a TwwDBGrid

Menu, title and TwwDBGrid erased

3 - The third example shows on the top the erased area where should be a TEdit, just bellow it there's what should be a line on a TwwDBGrid and on the side we can see an erased scrollbar from the TwwDBGrid

Tedit, line and scroolbar erased

4 - This last example shows 5 erased areas: The title of the application, the main TToolBar, The title of the Form, a TButton and two TwwDBGrid

enter image description here

5 - This is an interesting example beacause beyond the erased components there are 4 TSpeedButtons that are not erased but they are without the images they have originally (the first red stroked areas). The other 3 red stroked areas are, in order, 2 TEdits, a TwwDBGrd and a TButton

enter image description here

Process Information

I got a screenshot by the momment the problem occurs. scgolr is my software.

enter image description here

Ricardo Acras
  • 35,784
  • 16
  • 71
  • 112
  • Problems with graphics cards and drivers would be my guess. Toolbars are notoriously sensitive to such things. – David Heffernan Feb 02 '16 at 11:31
  • @DavidHeffernan, thanks for your will to help, as always. As you can see there are cases like on image 3 that the Toolbar is not erased and a TEdit and parts of TwwDBEdit are. Do you think that even having this case could be graphics/drivers? – Ricardo Acras Feb 02 '16 at 11:44
  • That's less likely I think. Perhaps the defect is in your program. – David Heffernan Feb 02 '16 at 11:45
  • @DavidHeffernan Yes, I wish it was in my program, would be easier to solve, I just don't know where to look to anymore.. :-/ – Ricardo Acras Feb 02 '16 at 11:50
  • Does this happen to any other applications on those machines while your application is running or without it running? Are those machines the same build of hardware? Do they use the same installation of Windows with the same service packs/updates/hotfixes? Do they share a common image/installation? – Blurry Sterk Feb 02 '16 at 12:41
  • Great points @BlurrySterk, will ask them and add this info in my question. – Ricardo Acras Feb 02 '16 at 13:36
  • Are these all design-time components? I assume yes. – Jan Doggen Feb 02 '16 at 14:12
  • @JanDoggen yes, all components inserted at design-time – Ricardo Acras Feb 02 '16 at 16:09
  • Considering that you have placed a bounty I would assume that you have already eliminated the possibilty of the machines being the cause and as such you know that it is a code issue? – Blurry Sterk Feb 04 '16 at 11:44
  • @BlurrySterk I placed a bounty to draw attention. This solution is very important to me. Unfortunately I don't have remote access to these computers and the IT guys of my client didn't send any information for me about the computers. – Ricardo Acras Feb 04 '16 at 12:19
  • What is the ancestor hierarchy of that TwwDBGrd? Please give the complete hierarchy. – Blurry Sterk Feb 04 '16 at 12:45
  • just few guess: the app saves some component status in the system registry / the registry has became invalid / the app cannot read the saved status ? – fantaghirocco Feb 04 '16 at 13:31
  • 1
    The components are missing when the form shows? Does the program work well until suddenly all new forms begin to fail? Maybe you have some memory or resource leak and are exhausting some Windows/GDI resource. Verify that you are properly freeing your forms when closed. – JRL Feb 04 '16 at 20:16
  • 2
    You already have the Processes tab open in the Task Manager. Please select the columns `GDI Objects` and `User Objects` using `View` (Exibir) -> `Select Columns`. Sort by these columns and see if there is any excessive use of those Objects. – Sebastian Z Feb 04 '16 at 20:49
  • 5
    Your program is used by a *few hundred* customers. **One** customer has seen the problem on less than a third of their sites. Since the problem is highly visible and workflow disruptive, I'm sure that you would have been contacted by other customers if it would be a general problem. Thus the key to the reason is to be found at those sites where it has been seen. In addition, the problem occured at a fairly specific time frame ( *a couple of weeks ago* ) in the beginning of this year. IMO the first step would be to clarify what changed at that time? .... – Tom Brunberg Feb 04 '16 at 22:11
  • 2
    (continues) Hardware, network, server, OS, other software. Not to forget, how your software is used. Really anything that was changed at that time. The list to check is long, and the aim is to find the common change for those few sites that see the problem? The change itself may turn out to be the actual reason (e.g. incompatible hardware) or it may just have triggered a weakness in your software. Once you know what brought the problem to the surface it reduces the search significantly. – Tom Brunberg Feb 04 '16 at 22:12
  • It is very unlikely that something of your executable was deleted. You can verify that by comparing the md5 sum of the "broken" exe with the md5 sum of a same working version of your exe. My guess is that a windows update on the customers machine or some user specific configuration leads to these errors. – Alexander Baltasar Feb 05 '16 at 06:50
  • @JRL The program works well for some time and suddenly some components are "erased". – Ricardo Acras Feb 05 '16 at 13:09
  • @SebastianZ Thank you. Will ask the client to do that and show the results here. – Ricardo Acras Feb 05 '16 at 13:10
  • @TomBrunberg Thank you for the complete insight. I'm asking the client for these information for a few days. They have very strict security rules so I don't have remote access to the machines. Will take a while until I get all these info but as soon as I get I will update my question. Thank you again. – Ricardo Acras Feb 05 '16 at 13:18
  • Just some hints: Check available GDI resource as described by @SebastianZ. Also check process handles, maybe you have handle leak. You can check [ATOM table usage](http://thundaxsoftware.blogspot.com.by/2012/02/monitoring-global-atom-table-part-i.html), but you need friendly client do this. Good luck. – Aleksey Kharlanov Feb 05 '16 at 19:18
  • *They have very strict security rules so I don't have remote access to the machines*. But as Tom Brunberg says: it is only on *their* machines. Do not hesitate to push through your superiors to get more cooperation from them. They surely can let you have remote viewing (not even control) for a limited time while you are on the phone with one of their techs doing the investigation. – Jan Doggen Feb 10 '16 at 15:07
  • Seems like a GDI leak. Check this: http://stackoverflow.com/questions/10231556/gdi-handle-leak-using-tgifimage-in-a-second-thread – andrucz Feb 10 '16 at 21:07
  • Seems like there is some modal forms on your application and, as you said, there is a thread that loads information from the cloud (may be related with http://stackoverflow.com/questions/10231556/gdi-handle-leak-using-tgifimage-in-a-second-thread ). Also check if all graphical elements are being released when closing these modal forms (to avoid GDI leak). – andrucz Feb 10 '16 at 21:17
  • Added information about GDI. – Ricardo Acras Feb 18 '16 at 17:55
  • You added no *information about GDI*. You added a screen capture of the process tab in Task Manager, with Google Chrome highlighted. – Ken White Feb 18 '16 at 18:07
  • @KenWhite The processes are ordered by the GDI column. The chrome process was selected by the user. She sent me that image. – Ricardo Acras Feb 19 '16 at 12:34

3 Answers3

2

There is really not enough detailed information to give you a definite answer. However, I can answer with some directions on your question:

How can I investigate further?

Because of what you have stated:

  • The program is in use by a few hundred customers
  • One (only) customer experiences the problem
  • First occurance of the problem was some weeks ago

the first thing to do, is get in contact with the customer, and get the information you say that you have asked for but not got. The questions that need to be answered are:

  • What has changed in the customers environment at the time the problem started with respect to hardware, network, server, OS, other software running in the PCs?
  • Has anything changed in the way your customer is using your software?
  • What do the customer have to do to get rid of the problem, once it occurs? Close the program? Restart the PC? Or maybe just minimize - restore the erroneous window?

With the above I do not suggest that the fault is with this one customer and their equipment or their way of using the software. It may just be that the combination at the site which is different from all your other customers, triggers the problem to show up.

Some specific things to check in your software and at the site when problem occurs and if the problem goes away with a minimize - restore of the application (which would suggest a painting interrupted problem:

  • Do you call Application.ProcessMessages at any time?
  • Does the background thread access same data as the GUI? If yes, are the data protection properly in place (locking, synchronisation).
  • Does the background thread access any GUI components without Synchronize?

Finally I suggest that you visit the customer onsite. You get much better and faster answers in a direct discussion.


Edit after process information received.

There is nothing alarming concerning GDI or User objects. But it is alarming when you say in the comments that you call Application.ProcessMessages in many places, obviously to 'fix' a non-responsive UI. For example, what happens if the user double clicks a button, but does it slowly enough that Windows detects it as to separate clicks? First click may start your long lasting procedure within which you call A.P. The second click is read from the message que which starts the same procedure. Now the second call to the procedure runs (with its own calls to A.P.) and eventually ends and execution returns to the first call. Depending on what you do in this procedure, you may well be messing up handles and device contexts etc. A strong recommendation said with a friendly intent: Get rid of those calls to A.P.

Tom Brunberg
  • 20,312
  • 8
  • 37
  • 54
  • I finally got access to the process information with the GDI information... added to the question. – Ricardo Acras Feb 18 '16 at 17:55
  • I call Application.ProcessMessages in many places where I have long running code. – Ricardo Acras Feb 18 '16 at 17:56
  • @Ricardo That is not good! You need to remove those A.P. calls. – Tom Brunberg Feb 18 '16 at 18:02
  • @Ricardo Here's [a link](http://stackoverflow.com/questions/25181713/i-do-not-understand-what-application-processmessages-in-delphi-is-doing) worth reading, and [here](http://delphi.about.com/od/objectpascalide/a/delphi-processmessages-dark-side.htm) an other one. – Tom Brunberg Feb 18 '16 at 18:07
  • Thank you Tom Brunberg! I looked deeper and actually I use ProcessMessages on a less then 10 places in my software. But I really didn't know this issue and it makes perfect sense now. @Passella's answer solved the problem, but based on your comments I already started a refactoring here to change places where I use ProcessMessages by Threads. – Ricardo Acras Feb 19 '16 at 15:48
1

As @SebastianZ and @AlekseyK pointed out you may experiment exaustin of some GDI resource (handles?). If the system coukd be accessed some tools like Process explerer or process hacker could give you some hints. This utility may help too GDIView

I don't know if this may apply to your case, but sometimes database data corruption can lead to strange effect in running programs (i remember 'Data Bombs' causing out of memory exceptions ...

So if something cause a GDI allocation loop, the graphics of your app cauld be affected in 'strange' ways

hute37
  • 197
  • 2
  • 8
1

the problem is with the security plugin (Warsaw - Gas Tecnologia) bank's website that your client is accessing , update it and it will be solved , the problem happens in Brazil

Passella
  • 640
  • 1
  • 8
  • 23
  • I know that any Gas Tecnologia's software are problematic, but are you certain that it causes this erased components problem? – Ricardo Acras Feb 19 '16 at 12:36
  • this problem has occurred here with our clients. http://www.reclameaqui.com.br/14363141/gas-tecnologia-a-diebold-company/gas-tecnologia-interferindo-em-sistemas-desenvolvidos-em-del/ – Passella Feb 19 '16 at 12:40
  • 1
    It was, indeed, the problem. For others to see what kind of trouble the Gas Technologia's software can cause, take a look at http://www.linhadefensiva.com/2013/04/brazilian-users-unable-to-boot-windows-after-botched-update/ – Ricardo Acras Feb 19 '16 at 15:55