2025 - Naveen Mareddy - Advanced Observability Strategies for Media Workflows at Netflix

youtube.com 12 godzin temu


Managing media workflows at the Netflix scale is both thrilling and daunting. With millions of workflow executions across hundreds of types and over 500 million CPU hours consumed quarterly, costs can skyrocket, and encoding issues can disrupt the streaming experience. The challenge is immense: ensuring the timely delivery of high-resolution encodes, avoiding costly codec bugs, supporting last-moment redeliveries, and identifying bottlenecks before they drain compute resources. How do we navigate this complex system without spiraling into budget and delay disasters? This isn't just about fixing bugs faster anymore. This is connected to observability driving real business value. Imagine instantly knowing the true cost of encoding each movie, or precisely tracking redelivery metrics that directly impact revenue.

We confronted these challenges directly and discovered that traditional observability tools, designed primarily for RPC-style services, were inadequate for media workflows. We required observability at scale to support asynchronous media workflows with long-running tasks. By embracing domain-specific events, distributed tracing, and consistent tagging, we achieved a comprehensive view of our users' workloads. We developed a stream-processing pipeline that processes events from various parts of media workflows and collates them into actionable insights. This powers our observability platform, capable of handling billions of events in real-time, enabling rapid insights and on-the-fly aggregations.

In this talk, we’ll cover the following aspects of how we built observability for long-running, distributed, and high-throughput systems, and how you can apply these learnings: Near real-time insights: Learn how to process events promptly to meet the monitoring needs of low-latency encoding. Discover techniques to enable users to catch bugs sooner, limiting wasted compute on encodes known to fail. Optimal rollup strategies: Explore how to consolidate millions of low-level events into hundreds of business insight events. We'll share techniques like pre-aggregation and event collapsing to minimize storage and efficiently support top queries. Opinionated tagging taxonomy: Understand the importance of a defined tagging taxonomy and how it ensures all business metrics are expressed consistently within your observability platform. Enabling ROI analysis for feature development: See how to facilitate long-pole analysis, gain insights into compute usage, and understand latency implications for better ROI analysis of your feature development.

By the end of this session, attendees will have concrete strategies to implement effective observability, transforming operations from reactive firefighting to proactive decision-making. Get ready to move from panic to clear, actionable insights, bringing clarity and control to your own large-scale systems!