Service Reliability Eng - London, N1C 4AG
Service Reliability Eng - London, N1C 4AG, United Kingdom
Job Summary:
We are UMG, the Universal Music Group. We are the world’s leading music company. In everything we do, we are committed to artistry, innovation and entrepreneurship. We own and operate a broad array of businesses engaged in recorded music, music publishing, merchandising, and audiovisual content in more than 60 countries. We identify and develop recording artists and songwriters, and we produce, distribute and promote the most critically acclaimed and commercially successful music to delight and entertain fans around the world.
As a key member of our Global Technical Operations team, you will be responsible for the reliability, scalability, and performance of the critical systems that power a global enterprise. By blending a software engineering mindset with operational expertise, you will engineer solutions that improve system reliability, automate complex processes, and reduce manual toil. You will be an essential partner to our development, infrastructure, and security teams, driving a culture of resilience and continuous improvement across the organization.
As a Site Reliability Engineer, you won't just be supporting systems; you'll be ensuring the services that connect artists and fans around the globe are always on.
Job Functions:
Key Responsibilities:
System Reliability & Performance:
Design, build, and maintain the availability, scalability, and performance of critical services.
Develop and maintain robust monitoring, alerting, and observability systems (e.g., using AWS CloudWatch, Dynatrace) to ensure rapid issue detection and resolution.
Monitor infrastructure capacity and performance, providing analysis and suggestions for service delivery improvement.
Automation & Efficiency:
Drive the automation of repetitive operational tasks, including infrastructure provisioning, deployments, and scaling.
Create and maintain scripts and custom code to support and enhance our operational toolset.
Support and optimize CI/CD pipelines to improve deployment speed and reliability.
Incident Management & Collaboration:
Participate in an on-call rotation to troubleshoot and mitigate production incidents.
Lead post-incident reviews and root cause analyses to implement lasting solutions.
Partner with engineering and IT stakeholders to embed SRE best practices (SLOs, error budgets) into the design and development lifecycle.
Job Requirements:
Required Experience & Skills:
A strong background in systems administration (Linux/Windows) in a large-scale environment.
Proficiency in at least one programming language (e.g., Python, Go, Java).
Hands-on experience with a major cloud platform (AWS, GCP, or Azure), with a high preference for AWS.
Solid understanding of networking, containers (Docker, Kubernetes), and Infrastructure as Code (e.g., Terraform, Ansible).
Experience with modern monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Splunk, Dynatrace).
Proven analytical and problem-solving abilities with experience in a high-pressure environment.
Excellent communication skills and the ability to foster a collaborative team environment.
Preferred Experience & Skills:
Bachelor's degree in an IT-related field.
Experience managing large-scale, distributed systems for a global organization.
Familiarity with IT governance standards like ITIL.
Direct experience with ServiceNow for IT service management.
Knowledge of chaos engineering, resilience testing, and advanced capacity planning.
Recommended Jobs
Recruitment Consultant (PO4)
Job Category : Human Resources Location : Lambeth Civic Centre, London Borough of Lambeth Hours Per Week : 35.00 Start Date : Immediate Start Start Time : 09:00 End Time : 17:30 Salary…
SEMH Learning Mentor
Looking for a role where you can genuinely make a difference? We’re working with schools in Wimbledon who are looking for a SEMH Teaching Assistant to support pupils who sometimes struggle with their …
Assistant Workspace & Community Manager (15 month Fixed Term Contract)
Join Fora as a Member Experience Manager and play a pivotal role in shaping the future of work. This is your chance to lead a passionate team, create exceptional experiences, and bring our vibrant …
Street Works Co-ordinator
Job Category : Technical Location : Lambeth Civic Centre, London Borough of Lambeth Hours Per Week : 35.00 Start Date : Immediate start Start Time : 08:00 End Time : 16:00 Salary: £92.…
Business Intelligence Manager
About WPP Media WPP is the trusted growth partner for the world’s leading brands. With exceptional talent, trusted data and intelligence, and world-class partnerships – all united by our pioneer…
Senior UK Office Coordinator
We are excited to partner with a fun brand who are looking for a senior UK office coordinator, based in central London. The company is known for its innovation and they always encourage creativity fr…
UX Researcher
About Us Magentus products and services have been at the forefront of delivering health technology for more than 30 years, offering deep expertise across clinical systems, health informatics and p…
Lead Teacher for Food Studies
What skills and experience we're looking for We are looking for a passionate, experienced and innovative Lead Teacher of Food Studies to provide strategic and operational leadership within our spec…
Head of IT Infrastructure and Security (IT)
Head of IT Infrastructure and Security In summary we are looking to recruit an all-round individual with expert knowledge and hands-on experience of IT Infrastructure coupled with Security, Complian…
Mathematics Role in Haringey (Independent)
We are working with a world-renowned independent school in Haringey to appoint a scholarly Mathematics Teacher for an ASAP or September start. The school is a bastion of academic excellence, catering…