3.1 Visibility, Debugging & Alerting

In the previous chapters, we focused on how payments move through the system and how different payment patterns can be modeled using Hyperswitch APIs. Once payments are live in production, however, processing them correctly is only half the story. The other half is knowing what is happening at all times at the business level, at the application level, and at the infrastructure level.

This chapter introduces the three core pillars of Observability & Monitoring in Hyperswitch:

Visibility
Debugging
Custom Alerting

Together, these ensure that merchants can monitor performance, investigate issues, and respond proactively to system or business anomalies.

Visibility

Visibility answers a fundamental question: What is the current health of my payments and my system?

At a business level, Hyperswitch enables high-level as well as granular visibility across organization, merchant account, and profile levels. Merchants can monitor metrics such as authorization rate, capture rate, success rate, volume trends, connector performance, payment failure reasons, retry effectiveness, and revenue distribution.

At an infrastructure level, visibility extends to system health and performance characteristics. Pod health metrics such as CPU utilization, memory saturation, request latency, and service availability are tracked through OpenTelemetry-based metrics collection and Vector aggregation within Hyperswitch. These metrics are then exported to Prometheus and visualized through Grafana dashboards. This provides real-time insight into system stability and capacity planning.

In addition to metrics, distributed traces collected via OpenTelemetry can be routed to Grafana Tempo, enabling end-to-end visibility of how a payment request traverses internal services from Router to Vault to Connectors.

Together, business metrics, infrastructure telemetry, and distributed tracing provide a complete operational picture of both payment performance and platform health.

Debugging

In Hyperswitch, debugging operates at multiple layers application logs, structured events, and distributed traces allowing teams to perform first-level and deep technical investigations.

At the application layer, logs generated by Hyperswitch services are captured and routed to Grafana Loki. These logs allow developers to inspect request payloads (masked where necessary), connector interactions, transformation steps, error codes, retries, and internal state transitions. Using structured log queries, engineers can quickly filter by payment_id, payment_attempt_id, connector name, or error type to isolate the root cause.

Beyond logs, payment lifecycle events are emitted as structured events and streamed via Kafka. These events represent state transitions (for example, processing → requires_capture → succeeded) and serve two purposes. They enable debugging by providing a chronological audit trail of what happened to a payment.

For deeper system-level debugging, OpenTelemetry traces are exported through the OpenTelemetry Collector and visualized in Grafana Tempo. Distributed tracing makes it possible to follow a single payment request across internal services Router, Vault, Decision Engine, Connectors helping identify latency bottlenecks or failure points across service boundaries.

By combining logs, lifecycle events, and distributed traces, Hyperswitch provides a layered debugging framework that supports both operational troubleshooting and long-term system analysis.

Custom Alerting

Monitoring becomes truly powerful when it shifts from reactive to proactive. This is where alerting plays a critical role.

Using Grafana’s alerting capabilities (backed by Prometheus metrics and business analytics), teams can configure alerts across multiple dimensions:

Infrastructure Alerts: Pod down, memory saturation, CPU spikes, high latency.
Application Alerts: Increase in 5xx errors, connector timeouts, webhook delivery failures, scheduler issues.
Business Alerts: Sudden dip in authorization rate, capture failures exceeding threshold, abnormal retry behavior, success rate degradation by connector.

Alerts can be routed to Slack, email, or webhook endpoints, ensuring operational teams are notified immediately when thresholds are breached.

Observability & Monitoring in Hyperswitch is not a single dashboard, it is a layered system that connects application services to logs (Loki), metrics (Prometheus), traces (Tempo), events (ClickHouse), and visualization/alerting (Grafana), with business dashboards surfaced through the Control Center.

PreviousRecurring Transactions Next3.2 Remote Monitoring

Last updated 6 days ago

Good night

hashtagVisibility

hashtagDebugging

hashtagCustom Alerting

Visibility

Debugging

Custom Alerting