Build your career at Paradigm!

 

 

 
 
 

Senior Director, Site Reliability Engineering & Cloud Operations

Location: , United States

Paradigm is a software company transforming the way that the residential construction & building product industries operate across the globe. We are looking for a Senior Director, Site Reliability Engineering & Cloud Operations to be part of revolutionizing these industries.

We’re looking for a forward-thinking leader to drive the modernization, reliability, and automation of our Azure-based platform. The Senior Director of SRE & Cloud Operations will lead both the modernization of our cloud and container infrastructure and the day-to-day operational execution of our customer environments — ensuring service uptime, performance, and continuous improvement of SLAs and SLOs. This role sits at the intersection of platform engineering, DevOps, and reliability leadership, with accountability for keeping existing customer systems running flawlessly while evolving the company’s cloud architecture toward a fully automated, containerized, and observable future.

What You Will Do:

  • Lead the shift toward containerized, microservice, and cloud-native architectures using Azure Kubernetes Service (AKS) and complementary Azure services.
  • Redefine infrastructure as software — fully declarative, automated, and policy-driven (Terraform, GitOps).
  • Drive modernization of compute, networking, and deployment models to increase scalability, reliability, and developer autonomy.
  • Collaborate with Architecture, Security, and Product Engineering to embed reliability, automation, and infrastructure alignment into new service designs, ensuring they support business priorities.
  • Design and operate production grade Kubernetes infrastructure supporting diverse production workloads including AI/ML models, data processing, and traditional applications.
  • Optimize clusters for mixed compute requirements: CPU-intensive, memory-intensive, and GPU-accelerated workloads.
  • Oversee operations and execution of existing customer infrastructure, ensuring SLAs are consistently met or exceeded and SLOs are continuously defined, tracked, and improved across key systems.
  • Drive a proactive approach to reliability, scaling, and performance across all hosted environments.
  • Implement robust incident, change, and problem management processes that balance stability with speed.
  • Establish KPIs for service health, uptime, and efficiency — measured and improved via automation and observability data.
  • Partner with customer success, engineering, and support teams to ensure seamless communication and resolution of production issues.
  • Build a platform layer that enables developers to deploy, scale, and monitor services through self-service tools.
  • Champion end-to-end automation — provisioning, deployment, scaling, failover, remediation, and reporting.
  • Ensure all infrastructure and operations are codified and repeatable, minimizing manual work and configuration drift.
  • Establish and enforce policy-as-code and compliance automation to maintain consistent governance across environments.
  • Lead a comprehensive observability transformation centered on Datadog, expanding its use across infrastructure, applications, and business metrics.
  • Build a unified telemetry strategy (metrics, traces, logs, RUM, synthetics) to enable proactive detection and intelligent incident prevention.
  • Develop meaningful dashboards, SLO-based alerting, and service-level insights for both engineering and leadership visibility.
  • Partner with product teams to make observability a core part of development workflows, ensuring every new service is instrumented from day one.
  • Use data from Datadog and other telemetry systems to inform reliability improvements, performance tuning, and capacity planning.
  • Embed SRE principles — SLOs, SLIs, error budgets, and blameless postmortems across all product teams.
  • Lead incident response programs and post-incident review processes to identify systemic improvements and reduce MTTR significantly.
  • Implement chaos engineering and resilience validation practices across environments.
  • Foster a learning culture around reliability, automation, and service ownership.
  • Lead globally distributed teams across SRE, DevOps, and Cloud Operations disciplines.
  • Define and execute the multi-year roadmap for reliability, observability, and automation maturity.
  • Ensure the organization is always ready to support customer commitments at enterprise scale.
  • Develop a platform fully containerized and automated using IaC and GitOps principles.
  • Drive efforts to reduce manual operational interventions by 80%+ through automation and AI-driven observability.
  • Ensure clear visibility into service health, reliability trends, and customer impact metrics.

What You Need to Succeed:

  • 12+ years in SRE, DevOps, or Cloud Infrastructure leadership, with 5+ years leading large or distributed teams.
  • Deep operational experience in Azure environments — AKS, App Services, Service Bus, Event Grid, Azure Monitor, etc.
  • Proven track record overseeing customer-facing production infrastructure and achieving measurable reliability gains.
  • Experience driving modernization and automation in hybrid or multi-cloud environments.
  • Strong understanding of SRE frameworks, operational governance, and performance engineering.
  • Technical Expertise in the following:
    • Kubernetes (AKS) and container orchestration at scale.
    • Terraform for infrastructure as code.
    • Azure-native observability tools and Datadog APM, RUM, and synthetic monitoring.
    • CI/CD systems such as GitHub Actions, Azure DevOps, or ArgoCD.
    • Advanced automation tooling — policy as code, GitOps workflows, and event-driven remediation.
  • Systems-level thinker with a bias for execution and measurable outcomes.
  • Strong communicator who can influence across engineering, product, and executive levels.
  • Deep belief in automation, data, and self-service as enablers of reliability.
  • Builder of high-performing teams and scalable operational frameworks.

Ready to Join? Apply now! MyParadigm.com/careers/

#Paradigm

 

 
 
 

 

 
 
 

Applicant Tracking System Powered by ClearCompany HRM Applicant Tracking System