Senior Site Reliability Engineer/Platform Engineer

Tecnología · Remote

Apply for this Job

Description

We have a new urgent request from one of our strategic clients based in the U.S.


They are looking for a highly hands-on Senior Site Reliability Engineer/Platform Engineer to improve the reliability, scalability, and operational maturity of our technology environment.


This is not solely an infrastructure administration role. The person will combine strong systems and cloud/platform expertise with software engineering skills to design, build, and automate the services that support our engineering organization. A key part of the role will be connecting and enhancing observability systems across the environment—building integrations, automation, dashboards, alerting workflows, and reliability tooling that give teams actionable visibility into system health and performance.


You will partner closely with engineering and infrastructure stakeholders to strengthen platform reliability, reduce operational friction, improve incident response, and establish scalable foundations for future product and AI initiatives.


Key Responsibilities

  • Build, maintain, and automate platform capabilities that improve system reliability, scalability, and developer productivity.
  • Develop code, scripts, integrations, and internal tooling to connect observability, monitoring, alerting, and incident-management systems.
  • Design and evolve observability practices across logs, metrics, traces, dashboards, alerting, and service health reporting.
  • Improve CI/CD, deployment automation, environment consistency, and operational workflows through Infrastructure as Code and automation.
  • Own reliability-focused initiatives including incident response, root-cause analysis, capacity planning, disaster recovery, backup strategy, and service resilience.
  • Partner with software engineers to establish SRE standards, production readiness practices, and service-level objectives.
  • Support and modernize core infrastructure, including cloud, virtualized, networked, and on-premise environments where applicable.
  • Identify repetitive operational work and proactively replace it with scalable, maintainable automation.

Requirements

What We’re Looking For

  • Strong software engineering experience, ideally with Python, Go, JavaScript/TypeScript, or a similar language used for automation and integrations.
  • Deep experience with cloud/platform engineering, Infrastructure as Code, CI/CD, containers, and production operations.
  • Hands-on experience with observability tooling such as Datadog, Grafana, Prometheus, ELK/OpenSearch, New Relic, Splunk, or similar platforms.
  • Experience building monitoring integrations, alerting workflows, dashboards, operational APIs, or internal developer tools.
  • Strong understanding of SRE principles, incident management, reliability engineering, and operational best practices.
  • Ability to operate independently, set technical direction, and work across both infrastructure and software engineering teams.