Informatica BDE ingestion job runs for 10+ hours and when killed and rerun completes in 3 hrs

Question

About my profile - I am doing L3 support for some of the BDE Informatica ingestion jobs that run on our cluster. Our goal is help application teams meet the SLA. We support job streams that run on top of Hadoop layer (Hive).

Problem Statement - We have observed that on some days BDE Informatica ingestion jobs run painfully slow and on the other days they complete their cycle in 3 hours. if the job is taking so much time, we usually kill and rerun which helps us, but that does not help us fix the root cause.

Limitations of our profile - Unfortunately, I don't have the application code or the Informatica tool but I have to connect to the development team and ask relevant questions so that we can narrow down the root cause.

Next Steps -

What sort of scenarios can cause this delay?
What tools can I use to check what may be cause of the delay?
Few possible questions which I may ask the development team are -
1. are the tables analysed properly before running the job stream?
2. is there any significant change in volume of data (this is bit unlikely as the job runs quickly on rerun)?

I am aware this is a very broad question and is requesting for help in approach rather than any attending a specific problem, but this is just a start to help fix this issue for good or approaching it in rational manner.

This is an interesting problem, but it is certainly too broad for the Stack Overflow platform. — halfer, Nov 10 '19 at 11:01

score 1 · Answer 1 · answered Dec 13 '18 at 17:20

You need to check the Informatica logs to see if it's hanging at the same step each time.

Assuming its not, are you triggering the jobs at the same time each day... say Midnight and it usually completes by 3am... but sometimes it runs till 10am, where you kill and restart?

If so, I suggest you baseline the storage medium activity, under minimal load, during a 3 hrs quick run and during the 10 hour load. Is there a difference in demand?

It sounds like a contention but that is causing a conflict. A process maybe waits forever instead of resuming when the desired resource is available. Speak to the DBAs.

Informatica BDE ingestion job runs for 10+ hours and when killed and rerun completes in 3 hrs

1 Answers1