Sr Service Reliability Engineer - London, N1C 4AG

Universal Music Group
London

Sr Service Reliability Engineer - London, N1C 4AG, United Kingdom

Job Summary:

We are UMG, the Universal Music Group. We are the world’s leading music company. In everything we do, we are committed to artistry, innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music, music publishing, merchandising, and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters, and we produce, distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.

As a key member of our Global Technical Operations team, you will be the ultimate escalation point and subject matter expert for all SRE operations. This is a senior technical role that requires a strategic mindset, deep-seated expertise in System Reliability Engineering. By blending a software engineering mindset with operational expertise, you will engineer solutions that improve system reliability, automate complex processes, and reduce manual toil. You will not only resolve the most challenging technical issues but also drive the operational strategy for SRE implementation at UMG.

As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on.

Job Functions:

Key Responsibilities:

  • System Reliability & Performance:

  • - Design, build, and maintain the availability, scalability, and performance of critical services.

  • - Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.

  • - Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.

  • Automation & Efficiency:

  • - Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.

  • - Create and maintain scripts and custom code to support and enhance our operational toolset.

  • - Support and optimize CI/CD pipelines to improve deployment speed and reliability.

  • Incident Management & Collaboration:

  • - Participate in an on-call rotation to troubleshoot and mitigate production incidents.

  • - Lead post-incident reviews and root cause analyses to implement lasting solutions.

  • - Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.

  • Act as the Final Escalation Point for SRE operations: Participate in resolving the most complex and critical incidents, which other teams have been unable to solve. Provide leadership during high-severity events, coordinating cross-functional teams to ensure rapid and effective resolution.

  • Develop Escalation Frameworks: Design, implement, and refine the escalation management process for the entire Global Technical Operations Center, ensuring that incidents are triaged, documented, and resolved efficiently.

  • Strategic Troubleshooting & Root Cause Analysis: Move beyond simple fixes to conduct deep-dive root cause analysis (RCA) for recurring, complex problems. Develop long-term solutions, including automation and architectural changes, to prevent future incidents.

  • Mentor & Uplevel the Team : Serve as a technical leader and mentor to junior engineers. Develop and lead training sessions on advanced security concepts, threat landscapes, and internal best practices to elevate the entire team's capabilities. Foster a culture of continuous learning and operational excellence within the team. Maintain and enhance knowledge of key technologies.

  • Architectural Collaboration : Partner with Dev Ops and Applications architects to influence and enforce standards. Ensure that new and existing systems are built on the principles of Infrastructure as Code and toil reduction.

  • Automation & Optimization : Identify opportunities for network automation, scripting, and tool development to streamline operational tasks and improve efficiency.

  • Documentation & Standards : Create and maintain comprehensive documentation for configurations, standard operating procedures (SOPs), and incident response protocols.

  • Communication & Stakeholder Management : Communicate effectively with technical and non-technical stakeholders, including senior management, regarding incident status, resolution plans, and identity or security issues. Build partnerships and trust with other information technology areas, vendor technical staff, and customers in the business units.

  • Make UMG the place to be : Mentoring and genuinely leading the team in a way that attracts and retains the best talent. UMG is a place where everyone can bring themselves fully to work and thrive, as a Leader you are a key part of this.

  • Work out of standard business hours will occasionally be required.

Job Requirements:

Required Experience & Skills:

  • A strong background in systems administration (Linux/Windows) in a large-scale environment.

  • Proficiency in at least one programming language (e.g., Python, Go, Java).

  • Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.

  • Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).

  • Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).

  • Proven analytical and problem-solving abilities with experience in a high-pressure environment.

  • Excellent communication skills and the ability to foster a collaborative team environment.

Preferred Experience & Skills:

  • Bachelor's degree in an IT-related field.

  • Experience managing large-scale, distributed systems for a global organization.

  • Familiarity with IT governance standards like ITIL.

  • Direct experience with ServiceNow for IT service management.

  • Knowledge of chaos engineering, resilience testing, and advanced capacity planning.

Posted 2025-12-24

Recommended Jobs

Procurement Officer- Strategic Projects 6 months FTC

Bauer Media Outdoor
London

Bauer Media Outdoor estate spans 12 European countries, helping advertisers reach millions of consumers through 110,000 Out-of-Home assets. With a diverse portfolio of public infrastructure solutions …

View Details
Posted 2025-12-21

School Business Manager - High-Achieving School - Enfield

Marchant Recruitment
Enfield, Greater London

School Business Manager required January 2026 High-achieving mixed school based in Enfield Business Manager to oversee finance, HR, and school operations Our Client is looking for a School Bu…

View Details
Posted 2025-11-07

Year 1 Class Teacher - Waltham Forest

Marchant Recruitment
Waltham Forest, Greater London

Are you an enthusiastic and creative KS1 teacher ready to inspire a love of learning in young children? We are seeking a Year 1 Class Teacher to join our supportive school in Waltham Forest from Janu…

View Details
Posted 2025-10-07

SEN Teaching assistant

Academics Ltd
London

Academics is proud to be working with a warm, inclusive primary school in Wandsworth, who are looking for a dedicated and enthusiastic SEN Teaching Assistant to join their team from November. This…

View Details
Posted 2025-10-24

Band 6 Locum Physiotherapist - Inpatient Neuro Role - London

Pulse
London

Band 6 Locum Physiotherapist – Inpatient Neuro Role – London Position:  Band 6 Locum Physiotherapist – Inpatient Neuro Role – London Banding: 6 Location:  London Hours:  Part-Time, 2/3 Day…

View Details
Posted 2025-07-31

Sales Assistant - 40h - Central London

SMCP
London

Since 1984, Claudie Pierlot has explored the world and enriched its universe with new discoveries. Half clothing store, half manifesto, the Parisian studio’s sweet madness is expressed in ready-to-we…

View Details
Posted 2025-12-09

Senior Product Manager - Safety AI

Samsara
London

Who We Are Samsara (NYSE: IOT) is the pioneer of the Connected Operations Cloud, which is a platform that enables organizations that depend on physical operations to harness Internet of Things (Io…

View Details
Posted 2025-11-30

Nursery Teacher - Croydon

Marchant Recruitment
Croydon, Greater London

Are you a warm and experienced Nursery Teacher looking to lead early-years provision in Croydon from January 2026? A welcoming primary with strong outdoor spaces is recruiting a Nursery Teacher to de…

View Details
Posted 2025-11-29

History Teacher - Girls’ School in Wandsworth

Marchant Recruitment
London

History Teacher – Girls’ School in Wandsworth (January Start) Location: Wandsworth Start Date: January 2026 Contract: Full-time, Permanent Salary: Paid to scale A highly successful g…

View Details
Posted 2025-12-10

Concession Manager

Monica Vinader
London

Job Title: Concession Manager Location: Liberty Reporting To: UK Retail Manager Who we are At Monica Vinader, we’re on a mission to prove that buying better, wearing longer and doing bett…

View Details
Posted 2025-10-04