Job Description
Key Responsibilities
Observability, SRE, DevOps roles with proven expertise across infrastructure and application-level reliability. Dynatrace, ELK, Splunk, and PagerDuty; SLI/SLO frameworks. Azure Kubernetes Service, Terraform,
Azure managed services
What will you do
Design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines, dashboards, and alerting strategies across distributed systems.
Drive observability improvements leveraging industry-leading tools (Dynatrace, ELK, Splunk, PagerDuty) to achieve real-time performance insights and comprehensive system visibility.
Instrument applications for end-to-end observability
implementing distributed tracing, metrics collection, and log aggregation across Node.js and .NET microservices and event-driven architectures.
Troubleshoot complex incidents in production environments, diagnosing root causes across multiple service layers, databases, caches, and APIs under load using SLISLO frameworks.
Investigate and resolve Azure Kubernetes Service (AKS) infrastructure, ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI, Redis, Functions, Event Grid).Translate business requirements into observable, resilient systems that meet defined SLIs SLOs and drive reliability improvements.
Automate operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best practices.
Lead incident response and remediation for mission-critical systems, conducting blameless postmortems and building resilience through chaos engineering and tabletop exercises
.Collaborate cross-functionally with development, platform, and business teams to improve service availability, scalability, and operational excellence.
What do you need to succeed
Must-have8 years hands-on experience in observability, SRE, or DevOps roles with proven expertise across infrastructure and application-level reliability.
Deep expertise in observability tooling Dynatrace, ELK, Splunk, and PagerDuty demonstrated understanding of observability principles (instrumentation, correlation IDs, SLISLO frameworks).Advanced proficiency with Azure Kubernetes Service (AKS), Terraform, and Azure managed services (SQL MI, Redis, Functions, Event Grid) proven ability to design and implement infrastructure-as-code solutions.
Strong hands-on experience instrumenting applications for comprehensive observability distributed tracing, metrics collection, and log aggregation across Node.js and .NET applications in microservices and event-driven architectures.
Proven troubleshooting expertise in distributed systems diagnosing root causes across multiple service layers, databases, caches, and APIs in production environments.
Excellent incident management skills hands-on experience with PagerDuty and ServiceNow ability to resolve high-severity incidents rapidly and conduct effective root cause analysis.
Knowledge of incident, problem, and change management processes, including SRE principles, blameless postmortems, and chaos engineering practices.Exceptional communication and leadership skills to coordinate across business and IT teams ability to lead remo