About my profile - I am doing L3 support for some of the BDE Informatica ingestion jobs that run on our cluster. Our goal is help application teams meet the SLA. We support job streams that run on top of Hadoop layer (Hive).
Problem Statement - We have observed that on some days BDE Informatica ingestion jobs run painfully slow and on the other days they complete their cycle in 3 hours. if the job is taking so much time, we usually kill and rerun which helps us, but that does not help us fix the root cause.
Limitations of our profile - Unfortunately, I don't have the application code or the Informatica tool but I have to connect to the development team and ask relevant questions so that we can narrow down the root cause.
Next Steps -
- What sort of scenarios can cause this delay?
- What tools can I use to check what may be cause of the delay?
- Few possible questions which I may ask the development team are -
- are the tables analysed properly before running the job stream?
- is there any significant change in volume of data (this is bit unlikely as the job runs quickly on rerun)?
I am aware this is a very broad question and is requesting for help in approach rather than any attending a specific problem, but this is just a start to help fix this issue for good or approaching it in rational manner.