Our client is an exciting start up company which was formed by like minded individuals who believe in using a powerful combination of human principles and machine intelligence, using automation help customers manage critical issues protecting their data and business, better digital engagement with customers and monitoring business risk.
As a Site Reliability Engineer, you will be involved in debugging complex issues, quick in dissecting and solving issues, responding to and repairing incidents.
All in a day’s work
- Build new tools and techniques to automate human-intensive tasks.
- Be immersed in a culture of CI by delivering scalable workflow automation solutions.
- Write and configure code to build and maintain tools for your team.
- Changing code and configuration of services to improve application reliability.
- Maintain common components (like CI/CD, monitoring, IAM or VPC configurations) built on top of AWS, Azure & GCP.
- Operate distributed systems at scale and improve the reliability of application and business.
- Maintain shared services (such as Kubernetes clusters).
- Support and plan systems that have a reliability-oriented feature set.
- Respond to and stabilize incidents when they occur and prevent repeatable issues.
- Share best practices, tools, and tippers.
Our ideal candidate
- A strong desire to develop and apply your knowledge to solve real world problems.
- Passionate about building and running some of the largest and most complex platforms and systems.
- Have deep systems knowledge with a focus on all infrastructure components and some related skills like software development, workflow optimisation, and system administration.
- Prior experience debugging code, operating a network, building hardware, etc.
- Designed and coded systems that can distribute a package in parallel to N servers.
- Worked with AWS, GCP, Azure, Elastic Stack, Kafka, Kubernetes, Calico, Terraform, Github, Gitlab, ArgoCD, Docker, JenkinsX or Thanos at scale.
- Ability to set up service monitoring to identify problems before they start and alerting to catch issues before you get a wave of customer support tickets.
- A strong awareness and ability to think “outside the box”
- Pays attention to detail while not losing focus on the bigger picture.
- Strong communication skills, like to work in a fast paced environment with a tight-knit team.