Global FMCG Company is hiring a Principle Site Reliability Engineer. If you are tired of being one of the many, and would like to become one of the few, join them in their Digital Transformation journey. Apart from the excitement of architecting the future, as a Software Engineer, you will be responsible for providing a Product Team with technical knowledge, guidance and deliverables that relates to designing, delivering and supporting bleeding edge applications, its components and interfacing them with an external world.
The Site Reliability Engineer for SRE team will be responsible for pro-active and continues monitoring and alerting of the full technology stack, capacity and availability management, spanning all technologies and interdependencies for the supported digital platform. Deep focus on continuity ensuring appropriate disaster recovery practices are in place. Helping other teams with the root cause analysis for major incidents and ensuring the preventative approach for reliability (e.g. simian army technic) is in place proving the resilience of the digital platform. Last but not least the constant automation of the supported process, queues, and technologies to ensure operational perfection, efficiency, and the team’s ability to invest into future.
This position implies constant interaction with the other teams of the IS and Digital organization so that it is essential that the role holder is a highly collaborative individual capable of managing conflict, presenting to junior, mid and senior audiences as well as driving results through others.
‐ Release and Change management to ensure the delivered solutions are ready for operation, by governing the test process and risk management and mitigation.
‐ Diving into and recommending improvements of the telemetry architecture and revising software designs to mitigate failures in any part of the stack.
‐ Designing and implementing the continues monitoring and alerting processes in terms of both telemetry scope and toolbox.
‐ Identifying resilience maturity for each product service ensuring teams are focused around building resilience into the code.
‐ Building disaster recovery and continuity plans for each product, component and service. (or align these dependent on vendor)
‐ Ensuring that any implementations of new systems and services are taking into account the telemetry standards / aspects.
‐ Applying quick automation practices to the key monitoring & alerting technologies to automate operational tasks and reduce the impact of mistakes.
‐ Defining and pro-actively monitoring the technology stack (top-to-bottom) looking for issues and fine tuning tooling to catch issues before they happen.
‐ Driving adoption of a multi-cloud, multi-vendor strategy to delivering technology services.
‐ Implementing and keeping live the Simian Army practices to continuously detect and surface key issues.
Global FMCG Company. This is a great opportunity to work in a diverse and international environment in Japan.
9:00 - 17:30（Mon - Fri）
Saturday, Sunday, and National Holidays, Year-end and New Year Holidays, Paid Holidays, Other Special Holidays
【Services / Benefits】
Social insurance, Transportation Fee
‐ Self-starter with the ability to appropriately prioritise and plan complex work in a rapidly changing environment
‐ Good communication and problem-solving skills
‐ Ability to analyze technical architecture of complex and highly scalable solutions
‐ Good understanding of performance/availability monitoring principles (including proactive and predictive monitoring) with ability to design monitoring layers for both live and new solutions
‐ Firm understanding of application resiliency principles
‐ Excellent knowledge of telemetry tools (Newrelic, Datadog, Kibana, ELK, CloudWatch, Splunk, etc.)
‐ Experience operating application suites in a public cloud. Preferably AWS
‐ Experience working within an Agile environment delivering faster at a higher quality level
‐ Willingness to travel.