On-Premises HA - Two Location Cold-Standby

Overview

This page describes a two-location, high-availability (HA) architecture for a Public Key Infrastructure (PKI) platform. The architecture provides redundancy across two geographically separated sites using a cold-standby, active/passive failover model.

1400
Overview Diagram Two-Location Setup

Architecture Components

This architecture spans over two geographically separated locations and consists of the following core components:

  1. Two CARA VMs running independently and unaware of each other.

  2. Two HSMs, clustered and connected via PKCS#11.

  3. Two CLM VMs running independently and unaware of each other.

  4. Two Database nodes forming an active/passive cluster.

The design also supports network segmentation to isolate different components, such as CLM (RA), CARA (CA), and the Database.

CARA VMs

The CARA VMs consist of the following core components:

  • Webserver / Reverse Proxy → Performs TLS termination and forwards traffic to the local CARA services.

  • CARA Services → Run locally behind the reverse proxy.

HSMs

The HSMs are clustered, with each HSM connected only to the CARA instance at its own location. The specifics of cluster configuration and replication are vendor-dependent and are not covered in this page.

CLM VMs

The core components of the CLM VMs are the following:

  • Webserver / Reverse Proxy → Performs TLS termination and forwards traffic to the local CLM services.

  • CLM Services → Run locally behind the reverse proxy.

Database Nodes

Two database nodes form an active/passive cluster with manual failover, hosting the databases for both CLM and CARA applications at their respective locations. There is also the option to deploy two separate clusters, to further isolate CLM and CARA application data, if so desired.

Data Flow and General Considerations

Clients and administrators connect directly to the CLM and CARA instances. Traffic to the active site can be controlled via static BGP routing or, for example, by switching CNAME records in DNS management.

  • Each CARA VM connects only to the HSM at its own location.

  • Each CLM VM connects only to the CARA VM at its own location.

  • Both CARA and CLM VMs connect directly to the database node at their own location.

In this architecture, failures must be detected by a monitoring system and handled manually by the operations team. Failover to the standby site is a conscious decision and requires manual action.

Due to the high split-brain inherent risk in not having proper quorum capabilities, setting up a hot-standby failover mechanism across two locations only is strongly discouraged.

Conclusion

This two-location cold-standby architecture provides a georedundant design for a PKI platform, ensuring availability of a standby site to recover operations, in case of a failure at the primary site.

Traffic can be directed to the sites via static BGP routes or CNAME records. However, failover requires manual intervention by the operations team. Proper monitoring is essential for detecting failures and coordinate recovery.

Overall, this architecture provides resilience against site-level failures, for inquiries where a third location for automatic failover is not feasible. Due to the manual failover, this architecture ensures high data integrity and predictable traffic flows.