How Did We Increase Incident Analysis Efficiency By Over 40 % using Splunk?

Post Views: 3,577

Splunk has been recognized to provide end-to-end data platform capabilities turning data into insights & actionable measures. Logs collection, aggregation, visualization, and insights are table stakes for enterprise applications. Elevating platforms to real-time log analysis, generating actionable insights by applying machine learning algorithms, analyzing failure trends, and real-time visualization during incidents are some of the key capabilities expected in modern platforms.

As we progressed our journey to find such capabilities, we chose Splunk Cloud as an enterprise solution and sharing key learnings as part of this article. This helped in reducing production incident resolution efficiency by over 40%.

To start with, we leveraged following key capabilities of Splunk:

Logs data collections by applications including batch applications
Monitoring and analyzing data generated in real-time
Create alerts based on insights or events
Visualize the information using various dashboards
Analyze daily, weekly, and monthly, trends, spikes, patterns

What was the key challenge?

Our application portfolio consisted of many batch applications (using ETL), and the key challenge was to find out the root cause of the issue during the production incident.
When we started taking feedback from the incident resolution team, they wanted answers to the following questions during the incident to resolve issues faster:

What was the sequence of events when processing failed for a batch job?
Why did the processing fail for a particular batch job?
What events were happening in the system when the failure occurred?
Where is the failure happening so that the respective team can be contacted?

What was our approach?

As there is no silver bullet to any problem, we started our journey to make life better for the production incident resolution team. Similar to any development process, applied the Agile philosophy to address these challenges and used Splunk as a platform using their out-of-the-box capabilities. Honestly speaking, any platform can’t address all your challenges but it can provide a path or the framework towards it.

Following the above approach, we incrementally did the following:

Used Splunk for log collection for each batch job
Formulated a logging strategy to ensure we log meaningful information to provide insights useful for the incident resolution team
Trained the team to use Splunk to create dashboards using real-time visualization

How did we reduce resolution time?

Here are some use-cases where incident resolution team started leveraging Splunk for faster resolution. At the end of the month, when we compare mean-time-to-resolution metric, it has improved by over 40%.

Use-case	Our Approach
Analyzing batch failure patterns	Extract patterns or useful information using Splunk and automate following: – Get the list of batch jobs in context – Automated failure analysis to provide insight to find the cause of failure
Visualization to find the anomalies	Using dashboards to observe any anomaly and identity the candidates causing the failures
Notify the team based on events	Built many notifications to alert such as: – Notify users whether a process is running or has stopped using Splunk Alerts – Generate email notification when a service is restarted – Notify users if an unwanted exception has occurred in a process and which needs immediate attention such as Job B is running more than its average execution time or a file is locked state for the last 1 hour
Proactive Monitoring	Using daily dashboards to observe trends and spending less time on monitoring by using practices to answer below questions: – What percentage of the batch has completed? – What is the expected time of completion of the batch? – How many jobs have completed till now? – What was the trend of batch completion since the last 6 months? – How has the batch done this month compared to the previous month? – How many jobs have failed daily in the last week? – How many times the SLA was breached in the last 15 days?

To conclude, we can use out-of-the-box platform like Splunk or similar but onus lies on putting the right framework & approach to address the challenge in-hand. Doing things incrementally and tracking the progress on a continuous basis is another important aspect as it says:

if you can’t measure it you can’t improve it

How Did We Increase Incident Analysis Efficiency By Over 40 % using Splunk?

What was the key challenge?

What was our approach?

How did we reduce resolution time?

Sample Splunk Dashboard

Leave a Comment Cancel reply