Key Qualifications
• Working understanding on Scaling, Capacity Planning and Disaster Recovery.
• Incident, change & problem management experience.
• Strong background in monitoring and logging large-scale platforms i.e. Nagios, Prometheus, Splunk, Icinga etc.
• Strong emphasis on SRE as an engineering function with a focus on architecture, design & automation.
• Familiarity with configuration and deployment management (Scripting, Visualisation, AWS, Unix, Java, Databases, Kubernetes, Docker etc
• Prior experience as an operations manager, SRE, system administrator, or similar role.
Job Description:
• This position forms part of a team operating global services, handling the requests of hundreds of millions of customers. This kind of scale presents unique challenges.
• SRE teams support the full infrastructure stack, from individual API performance to network traffic management. Responsibilities will be both broad and deep.
• This role will be predominately operational, focused on improving & supporting front-line SRE operations. The focus will centre on operational readiness, resiliency & quality standards. In addition, there will be the opportunity to contribute & define exciting but scalable reliability engineering projects.
• Are you ready to step into your next challenge? This position requires a mix of strategic engineering and design along with hands-on technical project work. You can help craft the future of how we build and run our services with a huge impact on a global scale. The scope will also require the technical nuance to go deep yet retain the ability to focus on higher-level business and product goals.
• This is an enthusiastic & highly driven team made up of a diverse set of experiences and skills. Our customers count on us to provide extraordinary availability, scalability, and security for services. Knowledge of SRE operations & what it takes to run services at an enterprise scale with a high degree of operational sustainability will be required.