HAYWARD HAWK is working with a long standing client of ours to find an SRE to join their expanding team.
Responsibilities: Collaborate with the Network Automation Team to develop and deploy infrastructure for a new automation platform.
Implement Site Reliability Engineering (SRE) principles focusing on measurement (SLI/SLO/SLA), reduction of manual tasks, and reliability modeling.
Define and monitor metrics to drive data-based decisions aimed at improving availability, reliability, and operational speed.
Create, maintain, and refine SLO and SLI baselines for network, system, and application performance.
Support go/no-go decision-making processes, validation/verification, and review of current and upcoming products/services.
Conduct proactive data analysis and testing to ensure optimal performance of production applications and services.
Troubleshoot and resolve business-impacting issues in collaboration with internal stakeholders.
Manage escalations, incident responses, root cause analysis, and conduct blameless postmortems.
Participate in a 24x7 on-call rotation.
Qualifications: Minimum of 3 years of experience with cloud/web/CDN scale infrastructure.
Proficiency in Python and Go; experience with C/C++ is advantageous.
In-depth knowledge of Linux systems, network programming, and protocols such as TCP, UDP, DNS, TLS/SSL, and HTTP.
Familiarity with BGP and Anycast routing is a plus.
Hands-on experience with DevOps practices and tools including Infrastructure as Code (Ansible/Saltstack), CI/CD (Gitlab, Jenkins, Git), and monitoring/visualization tools (Prometheus, Grafana).
Exposure to big data technologies like NoSQL/RDBMS, Redis, ElasticSearch, and Kafka.
Experience with containerization and container orchestration (Docker, Kubernetes).
Skilled in building and analyzing data telemetry, modeling, pipelines, and UI visualization.
Proven experience in software development, troubleshooting, and monitoring of large-scale distributed systems.
Adherence to software engineering best practices, standards, and the software development life cycle.
Knowledgeable about Agile software development methodologies.
Strong collaboration, communication, and documentation skills, with a demonstrated ability to work across functional teams.
Bachelors or Masters degree in computer science, engineering, or a related technical field, or equivalent experience.
For more information, please contact Alice Armstrong at Hayward Hawk.
Skills: Site Reliability Engineering Python CI/CD Linux