A leading Global Financial firm is looking for a Senior Site Reliability Engineer who will be responsible for keeping all user-facing services and other production systems running smoothly.
The Satisfaction Engineering Team exists to ensure and deliver satisfaction across the company solutions and services, and the reliability of our service offerings is the foundation customer satisfaction and trust in our brand.
You will be a problem-solver, with the following responsibilities:
- Be on a PagerDuty rotation to respond to incidents and provide support for service engineers with customer incidents.
- Use your on-call shift to prevent incidents from happening.
- Document actions taken, so your findings turn into repeatable actions–and then into automation.
- Design, build and maintain core infrastructure pieces that enable the command support hundreds of thousands of concurrent users.
- Debug production issues across services and levels of the stack.
- Mentor Interns and Intermediate SREs in all areas and other SRE in their area of deep knowledge.
- Contribute improvements to the codebase to resolve issues
- Identify significant projects that result in substantial cost savings or revenue
- Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach.
- Proactively plan for efficiency and capacity to set clear requirements and reduce system resources usage to make assets cheaper to run for all our customers.
- Identify parts of the system that do not scale, apply immediate palliative measures and drive long term resolution of these incidents.
- Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.
- Know a domain really well and radiate that knowledge through recorded demos, discussions in DNA meetings, or Incident Reviews
- Perform and run blameless RCAs on incidents and outages aggressively looking for answers that can prevent the incident from ever happening again.
- Set example for team of SREs with positive and inclusive leadership and discussion on work.
- Be able to de-escalate conflicts inside the team
【会社概要 | Company Details】
Our client is a global consulting firm that established a Japanese corporation in 2014. They have strengths in M&A and business strategies, boasting an error rate close to zero.
【就業時間 | Working Hours】
9:15 - 17:15（Mon - Fri）
【休日休暇 | Holidays】
Saturday, Sunday, and National Holidays, Year-end and New Year Holidays, Paid Holidays, Other Special Holidays
【待遇・福利厚生 | Services / Benefits】
各種社会保険完備（厚生年金保険、健康保険、労災保険、雇用保険）、 屋内原則禁煙（屋外に喫煙所あり）、 通勤交通費支給等
Social insurance, Transportation Fee, Retirement package, Cafeteria plan, No smoking indoors allowed (Designated smoking area), etc.
- Experience in the IT sector
- Experience as a Cloud, DevOps or Reliability engineer
- Work closely with engineering teams to create and improve containerized technologies
- Able to collaborate in a global team environment, actively engage subject matter experts, and follow through on commitments
- Strong problem solving (debugging) skills. The ability the dissect, divide and conquer platform problems and find root cause
- Knowledge of Microsoft Azure and / or AWS and / or GCP is a must (Azure preferred)
- Scripting knowledge in PowerShell and / or Python
- Version control experience (Git)
- Knowledge of container orchestration technologies (Kubernetes)
- Knowledge of container technologies (Docker)
- CI/CD knowledge