Service Reliability Eng - London, N1C 4AG
Service Reliability Eng - London, N1C 4AG, United Kingdom
Job Summary:
We are UMG, the Universal Music Group. We are the world’s leading music company. In everything we do, we are committed to artistry, innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music, music publishing, merchandising, and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters, and we produce, distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.
As a key member of our Global Technical Operations team, you will be responsible for the reliability, scalability, and performance of the critical systems that power a global enterprise. By blending a software engineering mindset with operational expertise, you will engineer solutions that improve system reliability, automate complex processes, and reduce manual toil. You will be an essential partner to our development, infrastructure, and security teams, driving a culture of resilience and continuous improvement across the organization.
As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on.
Job Functions:
Key Responsibilities:
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services.
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
Create and maintain scripts and custom code to support and enhance our operational toolset.
Support and optimize CI/CD pipelines to improve deployment speed and reliability.
Incident Management & Collaboration:
Participate in an on-call rotation to troubleshoot and mitigate production incidents.
Lead post-incident reviews and root cause analyses to implement lasting solutions.
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Job Requirements:
Required Experience & Skills:
A strong background in systems administration (Linux/Windows) in a large-scale environment.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
Proven analytical and problem-solving abilities with experience in a high-pressure environment.
Excellent communication skills and the ability to foster a collaborative team environment.
Preferred Experience & Skills:
Bachelor's degree in an IT-related field.
Experience managing large-scale, distributed systems for a global organization.
Familiarity with IT governance standards like ITIL.
Direct experience with ServiceNow for IT service management.
Knowledge of chaos engineering, resilience testing, and advanced capacity planning.
Recommended Jobs
Workplace Operations Coordinator
About the team The Real Estate and Workplace Services team at OpenAI is pivotal in crafting and maintaining the physical environments that fuel our innovation and growth. This team ensures our fac…
Chemistry Teacher - Croydon - Jan 2026
Are you an inspirational Chemistry Teacher looking for an exciting opportunity to work in an Outstanding school who will invest in developing your career, subject knowledge and give you the opportuni…
Nursery Teacher - Lewisham - January 2026
A nurturing and forward-thinking primary school in Lewisham is seeking a passionate Nursery Teacher to join their Early Years team from January 2026. This role is ideal for an educator who values pla…
Band 6 Paediatric Audiologist - Scotland
Job Title: 2 x Locum Paediatric Audiologists – VRA & Hearing Aid Services Banding: Band 6 Location: South Western Scotland Start Date: ASAP Salary: £27- £32 per hour Working Hours: …
Product Marketing Manager
Department Product Employment Type Permanent - Full Time Location London Workplace type Hybrid Reporting To Stefano Danelli Key Responsibilities Qualifications Job Benefits About X…
Senior Project Manager
With a company culture rooted in collaboration, expertise and innovation, we aim to promote progress and inspire our clients, employees, investors and communities to achieve their greatest potential. …
LKS2 Teacher - Brent
A thriving primary school in Brent is seeking an experienced and enthusiastic LKS2 Teacher to join the lower-KS2 phase from January 2026. This LKS2 Teacher post in Brent includes a full induction and…
Project Manager - Water Infrastructure
Job Description Overview Take our vision into the future. The AtkinsRéalis Infrastructure Project Delivery Practice (PDP) is the centre of excellence for the delivery of infrastructure proje…
Associate Pricing Actuary (Hybrid, 80-100%)
Location: London, GB Are you a motivated actuary looking to make an immediate impact in a dynamic global environment? Join our Corporate Solutions team at Swiss Re where you'll help shape pricing st…
Early Years Practitioner in Hampstead
Calling all Early Years Practitioners in Hampstead! This welcoming and bright nursery in Hampstead is currently looking for a dedicated Early Years Practitioner who holds a relevant qualifica…