Senior Observability & Telemetry Engineer - Radian Arc

Submer

London

Location & work modality: EMEA (remote)

Start: ASAP

Type of Contract: Permanent, full-time

About Radian Arc

Radian Arc, now part of InferX, Submer's AI cloud and GPU infrastructure platform, provides an infrastructure-as-a-service (IaaS) platform for running cloud gaming, artificial intelligence and machine learning applications inside telecommunication carrier networks. Our teams across the USA, Australia, Central Europe, Malaysia, Singapore and Japan offer telecom operators a GPU-based edge computing platform without the need for capital expenditure, facilitating low latency and improved economics for value-added services and the monetization of 5G investments.

What impact you will have

Mission: Design and build the observability platform that powers visibility, reliability, and performance insights for large-scale GPU cloud infrastructure as well as smaller edge deployments.

This role is responsible for designing and implementing key parts of the observability architecture across the platform, enabling engineering, operations, and customers to understand system behavior in real time across distributed AI workloads, GPU clusters, networking fabrics, storage systems, and edge inference environments.

You will design and operate low-latency, high-scale telemetry pipelines that collect, process, and analyze metrics, logs, and traces from infrastructure running across core datacenter clusters and smaller edge deployments. The platform you build will support internal operations, automated reliability mechanisms, and customer-facing observability experiences.

As a senior engineer, you will lead delivery of major observability initiatives, contribute to the evolution of telemetry standards and SLO implementation, and work with other teams to ensure observability is effectively integrated into the platform architecture from infrastructure to application layers.

You will collaborate closely with infrastructure, networking, storage, and platform engineering teams to provide clear visibility into performance bottlenecks, infrastructure degradation, and distributed workload behavior across both hyperscale GPU environments and smaller edge installations.

This role contributes directly to improving platform reliability by analyzing production telemetry, identifying systemic issues, and driving improvements in performance, efficiency, and operational stability across the stack.

What you’ll do

Observability Platform Architecture

Design and implement scalable telemetry pipelines for metrics, logs, and traces across distributed GPU infrastructure.

Architect observability systems capable of ingesting high-cardinality telemetry from thousands of nodes and services.

Build and operate telemetry storage systems optimized for large-scale time-series and event data.

Contribute to observability standards across services, including metrics, tracing instrumentation, logging, and SLO implementation.

Infrastructure and Platform Observability

Build visibility across compute, storage, and networking layers of the platform.

Instrument GPU clusters, inference workloads, and distributed training environments.

Detect infrastructure degradation such as:

GPU throttling,

Network congestion,

Storage latency,

Hardware degradation.

Implement telemetry pipelines for GPU, CPU, network, and storage performance metrics.

Customer-Facing Observability

Build dashboards and monitoring tools that expose system health and performance to both internal teams and customers.

Provide insights into workload performance including:

GPU utilization,

Storage throughput,

Network latency,

Distributed inference performance.

Develop performance analysis tools that help customers understand system bottlenecks.

Network and Infrastructure Telemetry

Develop and maintain network observability platforms.

Build telemetry collectors and exporters using Python or Go.

Ingest telemetry from infrastructure components including:

NVIDIA Cumulus Linux,

VyOS routers,

Citrix NetScaler / WAF.

Design telemetry ingestion pipelines using protocols such as:

gNMI,

SNMP,

Streaming telemetry.

Reliability Engineering

Design advanced alerting and anomaly detection systems.

Contribute to platform SLOs, SLIs, and reliability metrics.

Build automated detection of infrastructure anomalies.

Integrate observability signals with operational workflows and incident management systems.

Participate in on-call rotations supporting platform observability and telemetry infrastructure.

Cross-Team Collaboration

Partner with platform, networking, storage, and compute teams to instrument services.

Work closely with operations teams to improve monitoring and incident response.

Provide guidance and mentorship to engineers on observability best practices.

Promote good observability practices across teams and help engineers adopt effective instrumentation and monitoring patterns.

Technical Stack: Observability and telemetry technologies used across the platform include:

Observability Framework

Prometheus.

OpenTelemetry.

Grafana.

Distributed logging systems.

High-scale telemetry databases, such as ClickHouse or similar.

Hardware and Infrastructure Telemetry

Redfish / BMC telemetry.

IPMI.

Linux system metrics.

Hardware health monitoring and node lifecycle telemetry.

NVIDIA GPU Telemetry

NVIDIA DCGM.

DCGM Exporter.

NVML.

NVIDIA GPU Operator telemetry stack.

NVSwitch / NVLink telemetry.

AI Workload Telemetry

Distributed training telemetry.

Inference latency and throughput metrics.

NCCL communication health.

GPU synchronization latency.

KV-cache access latency for inference workloads.

Dataset loading and storage I/O performance.

Networking Telemetry

NVIDIA NetQ.

gNMI streaming telemetry.

SNMP.

Network flow telemetry.

RDMA / RoCE performance monitoring.

What you’ll need

Required Experience

Proven experience operating large distributed infrastructure platforms.

Strong background in observability systems and telemetry pipelines.

Experience building metrics, logging, tracing, alerting, and dashboards at production scale.

Strong programming skills in Go, Python, or Rust.

Experience with large-scale time-series data platforms.

Experience with large-scale GPU cloud platforms, HPC environments, or AI infrastructure.

Experience monitoring AI workloads such as training or inference clusters.

Infrastructure Knowledge

Deep understanding of distributed systems observability.

Familiarity with cloud-native infrastructure such as Kubernetes, automation, and CI/CD.

Experience operating observability systems for high-performance or large-scale environments.

Networking and Infrastructure Telemetry

Experience monitoring complex networking environments.

Familiarity with telemetry protocols such as gNMI, SNMP, and streaming telemetry.

Experience integrating network and system telemetry into centralized monitoring platforms.

Analytical Skills

Strong data analysis capabilities.

Ability to interpret complex telemetry signals and translate them into actionable insights.

Ability to diagnose performance issues across distributed systems.

What we offer

Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands.

Our Inclusive Responsibility

Radian Arc is committed to creating a diverse and inclusive environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected category under applicable law.

Posted 2026-03-21

Recommended Jobs

Software engineer - cloud

CrowdStrike

London

As a global leader in cybersecurity, CrowdStrike protects the people, processes and technologies that drive modern organizations. Since 2011, our mission hasn’t changed — we’re here to stop breaches,…

View Details

Posted 2026-03-06

Teacher of History required (ECT Support) - Islington

Marchant Recruitment

London

School Status & Location Sector: Outstanding Ofsted Secondary School. Borough: Islington (Inner London, England). Start Date: Permanent, full-time role commencing ASAP. The Opportunity & …

View Details

Posted 2026-01-10

Biology Teacher - Lovely Enfield Independent School

Marchant Recruitment

Enfield, Greater London

School Status & Location Sector: Leading Independent School, Outer London. Borough: Enfield. Start Date: Permanent, full-time role commencing January 2026. The Opportunity & School Prof…

View Details

Posted 2025-11-19

Sales Manager

Johnson Controls

London

Sales Manager Location: London or Midlands | Hybrid We’re looking for an experienced Sales Manager to help grow the EasyIO Neo portfolio and strengthen our UK partner network. What You…

View Details

Posted 2026-02-09

At Your Service / Switchboard Agent - JW Marriott Grosvenor House

Marriott

London

AYS / Switchboard - EXPLORE MARRIOTT Marriott International portfolio of brands includes both JW Marriott and Marriott Hotels. JW Marriott is part of Marriott International's luxury portfoli…

View Details

Posted 2026-02-24

Senior Data Analyst

Qodea

London

Work where work matters. Elevate your career at Qodea, where innovation isn't just a buzzword, it's in our DNA. We are a global technology group built for what's next, offering high calibre pro…

View Details

Posted 2026-03-06

Senior Finance Manager

London Borough of Haringey

Haringey, Greater London

Job Category : Interims Location :ALEXANDRA HOUSE, London Borough of Haringey Hours Per Week : 36.00 Start Date : Immediate Start Start Time : 09:00 End Time : 17:30 Salary: £357.60 Pe…

View Details

Posted 2025-08-29

Teacher of Chemistry (ECT Support) - Barnet Independent...

Marchant Recruitment

Barnet, Greater London

School Status & Location Sector: Leading Independent School Borough: Barnet (Outer London). Start Date: Permanent, full-time role commencing January 2026. The Opportunity & School Profi…

View Details

Posted 2025-11-25

Paint ceiling and walls

London

What surfaces need to be painted? Wall,To be defined together,Ceiling What is the surface area to be painted in m²? (optional) 30 Which rooms are concerned? Living room,Hallway Does the …

View Details

Posted 2026-03-22

Field Care Supervisor

Care Outlook LTD

Hayes, Hillingdon, Greater London

Care Outlook is an expanding leading home care provider in London and South East of England since 2005. Our Care team is friendly, and we love what we do. We are passionate about the high-quality su…

View Details

Posted 2026-03-21