iTRiders

Job Details

Home
Jobs

Site Reliability Engineer (Sre)

J&M Group Inc Urgent Hiring

Back to jobs

Job Description

Key Responsibilities

Observability, SRE, DevOps roles with proven expertise across infrastructure and application-level reliability. Dynatrace, ELK, Splunk, and PagerDuty; SLI/SLO frameworks. Azure Kubernetes Service, Terraform,

Azure managed services

What will you do

Design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines, dashboards, and alerting strategies across distributed systems.

Drive observability improvements leveraging industry-leading tools (Dynatrace, ELK, Splunk, PagerDuty) to achieve real-time performance insights and comprehensive system visibility.

Instrument applications for end-to-end observability

implementing distributed tracing, metrics collection, and log aggregation across Node.js and .NET microservices and event-driven architectures.

Troubleshoot complex incidents in production environments, diagnosing root causes across multiple service layers, databases, caches, and APIs under load using SLISLO frameworks.

Investigate and resolve Azure Kubernetes Service (AKS) infrastructure, ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI, Redis, Functions, Event Grid).Translate business requirements into observable, resilient systems that meet defined SLIs SLOs and drive reliability improvements.

Automate operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best practices.

Lead incident response and remediation for mission-critical systems, conducting blameless postmortems and building resilience through chaos engineering and tabletop exercises

.Collaborate cross-functionally with development, platform, and business teams to improve service availability, scalability, and operational excellence.

What do you need to succeed

Must-have8 years hands-on experience in observability, SRE, or DevOps roles with proven expertise across infrastructure and application-level reliability.

Deep expertise in observability tooling Dynatrace, ELK, Splunk, and PagerDuty demonstrated understanding of observability principles (instrumentation, correlation IDs, SLISLO frameworks).Advanced proficiency with Azure Kubernetes Service (AKS), Terraform, and Azure managed services (SQL MI, Redis, Functions, Event Grid) proven ability to design and implement infrastructure-as-code solutions.

Strong hands-on experience instrumenting applications for comprehensive observability distributed tracing, metrics collection, and log aggregation across Node.js and .NET applications in microservices and event-driven architectures.

Proven troubleshooting expertise in distributed systems diagnosing root causes across multiple service layers, databases, caches, and APIs in production environments.

Excellent incident management skills hands-on experience with PagerDuty and ServiceNow ability to resolve high-severity incidents rapidly and conduct effective root cause analysis.

Knowledge of incident, problem, and change management processes, including SRE principles, blameless postmortems, and chaos engineering practices.Exceptional communication and leadership skills to coordinate across business and IT teams ability to lead remo

Job Overview

Job Type: Contract
Work Mode: Hybrid
Deadline: Apply by Jun 04, 2026
Job Location: Toronto
Category: Engineering & Infrastructure
Hourly Rate: