4.1 Fault, Capacity and Change

After understanding payments, lifecycle states, and observability/monitoring, the next critical dimension of a production-grade payment system is how it behaves under stress, failure, and adversarial conditions.

Payments are not just about successful authorization they are about reliability, scalability, and strong security guarantees. In this chapter, we explore how Hyperswitch is designed to operate safely at scale while maintaining high availability and protecting sensitive financial data.

The architecture is structured to ensure continuous availability, predictable performance, and safe failure handling, even under unpredictable traffic spikes, dependency outages, deployment updates, or distributed attacks.

The services that sit on the critical path of payment processing are engineered for extremely high reliability while keeping the design simple and resilient. The system supports instant failover and graceful degradation, meaning that even when something goes wrong, payments continue to flow wherever possible. Failures are carefully contained by introducing boundaries and limits between subsystems, reducing the blast radius and preventing one issue from cascading across the entire platform.

Broadly, system reliability can be challenged by three key factors:

Fault - unexpected failures caused by natural issues (like network instability) or malicious activity
Capacity - exhaustion of compute, memory, or artificial system limits
Change - disruptions introduced through code releases, configuration updates, or infrastructure modifications

Let’s now take a closer look at the architectural blueprint and understand how Hyperswitch addresses reliability challenges from an infrastructure perspective.

The production deployment blueprint of Hyperswitch is structured as a layered architecture spanning the Web Layer, Application Layer, and Storage Layer. At its core, it includes five primary application components, the Router (within the Hyperswitch App), Producer, Consumer, Control Center, Web, and Card Vault supported by dedicated services such as the encryption service.

Surrounding these core components are several non-functional but critical infrastructure services. These include monitoring services, event and log management systems, storage services (with master–replica configurations), cache services, encryption services, and load-balancing layers. While we previously discussed these components functionally in earlier chapters, we now examine them from an infrastructure and reliability standpoint.

The critical path of Hyperswitch payment system extends through the select components across the three layers:

Web layer: Firewall, External Load Balancer, Inbound proxy and Internal Load Balancer.
Application layer: The three critical applications are (i) hyperswitch-app (router) which is responsible for payment processing, (ii) hyperswitch-web which is a javascript sdk delivering the payment experience and (iii) hyperswitch-encryption-service which is a lightweight performant service to encrypt/ decrypt data, key management and key rotation.
Storage layer: The app storage service and the cache service for persistent and temporary storage.

Let us examine how Hyperswitch handles faults, capacity shifts, and change events in a structured and predictable way.

Fault

Functional critical path with partial degradation

Hyperswitch is designed around a clearly defined functional critical path — the minimal set of components required to successfully process a payment. These include the core payment processing services that directly handle authorization, routing, and state transitions.

To prevent complete system failure in the event of individual component outages, Hyperswitch implements a partial degradation strategy. The system includes a kill-switch mechanism that can automatically disable non-critical services when instability is detected, ensuring that core payment processing continues uninterrupted.

The components considered non-critical to the immediate payment authorization flow include:

hyperswitch-card-vault
hyperswitch-control-center
Monitoring services
Queuing services
Event and log management services

If any of these components experience failure or instability, they can be gracefully isolated without bringing down the core transaction processing path. This significantly reduces blast radius and prevents cascading failures.

Additionally, system behavior under fault scenarios is regularly validated through chaos testing, ensuring that failure modes are well understood and that the system responds predictably under stress conditions.

This approach ensures that even during partial outages, Hyperswitch prioritizes what matters most continued payment processing with controlled degradation rather than complete downtime.

Malicious intent

For malicious intent (like a DDoS attack), Hyperswitch leverages AWS shield advanced as a real-time, ML-based defense system, which uses a combination of device fingerprinting, behavioural analysis, request patterns and IP reputation to detect and block malicious sources attempting to flood the system with requests.

While the existing setup guarantees a 99.999% uptime for the critical path payment capabilities, the multi-region active-active setup is expected to provide the same uptime without degradation.

Capacity

Rate limiting and Resource limiting

Every merchant account is restricted with a predefined RPS provisioning, at the layer above the application. This prevents any overload of the system from an unexpected surge of requests from a particular source.

In case of surge of asynchronous events such as webhooks, batch refunds the requests are automatically added to queues to avoid overloading of the resources which should be available for the critical path.

Capacity monitoring, auto scaling & alerts

Hyperswitch is deployed on AWS cloud managed infrastructure. Capacity is modeled, provisioned, and auto-scaled dynamically. The system has granular capacity monitoring of the compute and storage resources to provision for the required scale in an ongoing manner.

The application is deployed through the managed Kubernetes service of AWS. In case of traffic surges, the system can horizontal scales without manual intervention, through HPA (horizontal pod autoscaler).

Managing 10x traffic spikes

Hyperswitch uses relational databases (PSQL instance) for recording real-time payment information. These usually access the disk, which are generally a slower and expensive operation. During peak traffic loads (10x the typical steady state volume), the volume of database accesses - especially the writes - could become a bottleneck and stretch the relational database systems. While vertical scaling with scaling the no. of connections could be one of the solutions, it can hit limits very soon for a system operating at unexpected scale.

Hence, Hyperswitch was designed to enable a key-value store (Redis layer) atop the database. Any new writes to the database would be captured as table objects in Redis, against the table’s primary key identifier (to be later copied to the database). The reads are checked in Redis first, and would fall back to the database only if the result isn’t found in Redis. The high throughput of Redis allows it to be used as a buffer against traffic spikes of up to 3000 TPS.

Parallely, Hyperswitch also uses Redis streams to house the new data in Redis with the appropriate tags. They are later decoded to the right table in the database and the actual operation to be executed. A background process called ‘drainer’ runs, which then reads these Redis streams and executes them on the database at a constant rate. A valve-like control determines the draining rate and ensures the database is never overwhelmed.

Thus Redis layer thus acts as a shock absorber, damping any impulses to the database even as the overall payments traffic spikes.

Optimized critical path

The web layer components (load balancers, inbound and outbound proxies) and core infrastructure components (kubernetes nodes & clusters) are ensured to have abundance by leveraging the Multi AZ setup provided by AWS.

The application components (router ensure parallelization wherever possible, ensuring that multiple DB operations can be executed simultaneously. Any delays associated with fetching configuration data for each transaction by caching all merchant and processor-related configurations in memory.

Change

Config/Code staggering and A/B testing

An A/B testing framework enables parallel evaluation of old vs. new configurations, or old vs. new code changes.

Hyperswitch tracks anomalies in metrics like auth success rate, latency, and error codes between cohorts of payment dimensions and merchants. If a new configuration or a code change degrades the metrics, the release is flagged and rolled back automatically.

CI/CD checks

The application is built on type-safe programming languages (rust and rescript). Hence, the application leverages compiler plugins, code commit checks to prevent errors at an early stage without propagation to production. Further, regression tests are automated and executed on a daily basis in the CI/CD environment.

Isolation

Considering the nature of the product, majority/ high frequency changes within Hyperswitch happen in the connector integration layer. Hence, to prevent such high frequency changes from creating a failure scenario to the system, the integration layer (within the router) had been modelled as a stateless service and an independent Rust crate. Hence, any unforeseen bug/ compatibility issue is naturally contained to the specific connector, without affecting the core payment system. This allows to push integrations in an agile manner, while ensuring the reliability of the critical path system.

Previous3.2 Remote Monitoring Next4.2 Security

Last updated 6 days ago

Good afternoon

hashtagFault

hashtagFunctional critical path with partial degradation

hashtagMalicious intent

hashtagCapacity

hashtagRate limiting and Resource limiting

hashtagCapacity monitoring, auto scaling & alerts

hashtagManaging 10x traffic spikes

hashtagOptimized critical path

hashtagChange

hashtagConfig/Code staggering and A/B testing

hashtagCI/CD checks

hashtagIsolation