Skip to main content

Command Palette

Search for a command to run...

SLOs That Product Managers Actually Understand

Published
3 min read
SLOs That Product Managers Actually Understand
S
Building Nova AI Ops — The AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.

The SLO Translation Problem

You define an SLO: 99.95% availability with p99 latency under 200ms. Engineering loves it. Product managers glaze over.

The problem isn't the SLO. It's how we communicate it.

Speaking Product Language

Translate technical SLOs into business impact:

Technical SLO:                  Product translation:
───────────────                 ──────────────────────
99.95% availability             "22 minutes of downtime per month max"
p99 latency < 200ms             "The slowest 1% of users wait under 0.2s"
99.9% error-free transactions   "For every 1000 purchases, at most 1 fails"

Suddenly, the product manager can make informed tradeoffs.

The SLO Negotiation Framework

SLOs should be negotiated between engineering and product. Here's my framework:

Step 1: Measure Current Performance

def current_performance(service, window_days=30):
    metrics = query_prometheus(f'''
        avg_over_time(
            (1 - rate(http_errors_total{{service="{service}"}}[5m]) 
             / rate(http_requests_total{{service="{service}"}}[5m]))
            [{window_days}d:1h]
        )
    ''')
    return {
        'availability': f"{metrics * 100:.3f}%",
        'monthly_downtime_minutes': round((1 - metrics) * 30 * 24 * 60, 1)
    }

# Example output:
# {'availability': '99.847%', 'monthly_downtime_minutes': 66.1}

Step 2: Present the Cost-Reliability Tradeoff

Reliability Level | Monthly Downtime | Eng Investment  | Feature Impact
────────────────-─┼─────────────────┼────────────────-┼──────────────
99.5%  (current)  | 3.6 hours       | Baseline        | None
99.9%  (good)     | 43 minutes      | +1 SRE          | -10% velocity
99.95% (great)    | 22 minutes      | +2 SREs         | -20% velocity
99.99% (amazing)  | 4.3 minutes     | +4 SREs         | -40% velocity

This makes the cost explicit. Most product teams choose 99.9-99.95%.

Step 3: Define SLIs That Map to User Journeys

Don't define SLOs per service. Define them per user journey:

slo_definitions:
  - name: "Checkout Success"
    description: "Users can complete a purchase"
    sli: |
      successful_checkouts / total_checkout_attempts
    target: 99.9%
    window: 30 days
    owner: payments-team
    product_owner: @sarah

  - name: "Search Responsiveness"  
    description: "Search results appear quickly"
    sli: |
      search_requests{latency < 500ms} / total_search_requests
    target: 99.5%
    window: 30 days
    owner: search-team
    product_owner: @mike

  - name: "Login Reliability"
    description: "Users can log into their accounts"
    sli: |
      successful_logins / total_login_attempts
    target: 99.99%  # Higher because login blocks everything
    window: 30 days
    owner: identity-team
    product_owner: @lisa

Step 4: The Monthly SLO Review

We run a 30-minute monthly meeting with engineering leads AND product managers:

Agenda:
1. SLO status dashboard review (5 min)
   - Which SLOs are healthy? (green)
   - Which are at risk? (yellow)
   - Which were breached? (red)

2. Budget impact (10 min)
   - Error budget consumed per SLO
   - Projected budget at current burn rate
   - Feature freeze triggers

3. Tradeoff decisions (15 min)
   - Feature X requires relaxing SLO Y — approve?
   - Incident Z consumed 40% of budget — invest in fix?
   - New service launching — what SLO target?

The Dashboard That Changed Everything

We built a single-page SLO dashboard with three views:

  1. Executive view: Traffic lights per user journey. Green/Yellow/Red.
  2. Product view: Error budget remaining + projected depletion date.
  3. Engineering view: Burn rate charts + contributing incidents.

Same data, different lens. Everyone gets what they need.

Key Insight

SLOs are a communication tool first, a technical tool second. If only engineers understand your SLOs, they're not working.

If you want SLOs that automatically track, alert, and report in plain language, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo BSc · MSc · MBA · PhD Founder & CEO, Nova AI Ops. https://novaaiops.com

More from this blog

N

Nova AI Ops Blog — SRE, Observability & Incident Response

58 posts

Honest, practical writing on SRE, observability, and incident response from the team building Nova AI Ops — the AI-native platform replacing 12 monitoring tools. Deep-dives on alert fatigue, runbook automation, post-mortems, Kubernetes, and everything operational engineers actually deal with at 3 AM.