Multi-Tier High Availability Architecture for On-Premises PKI Platform

Overview

This page describes a multi-tier, high-availability (HA) architecture for a Public Key Infrastructure (PKI) platform. The system is designed with redundancy, failover mechanisms, and database clustering to ensure continuous operation and resilience.

Architecture Components

The architecture consists of three locations:

Location A (Primary)
Location B (Secondary)
Location C (Arbitrator)

Each location has distinct roles to provide a fault-tolerant PKI environment.

Location A (Primary)

Frontend Network:
- Keepalived (VRRP) for high-availability networking.
- Keycloak for authentication and identity management.
- ACME for automated certificate issuance.
- OCSP / HTTP CRL for certificate revocation services.
- EST, CLM, and SCEP for certificate management.
- Runs on a single Virtual Machine (VM).
Backend Network:
- Keepalived (VRRP) for HA networking.
- CARA Admin and CARA for certificate authority and administration.
- MariaDB Galera Node for database clustering.
- Runs on a single VM.
HSM Network:
- Hardware Security Module (HSM) for key management and cryptographic operations.
- HSM synchronization between locations.

Location B (Secondary)

Mirrors the setup of Location A to provide failover capability.
Participates in the HA clustering for both frontend and backend services.
Maintains database replication with the primary site via MariaDB Galera Cluster.
HSM is also present and synchronized with the primary site.

Location C (Arbitrator)

Ensures quorum for MariaDB Galera Cluster.
Runs Keepalived (VRRP) to participate in HA networking.
Acts as a MariaDB Galera Node or Arbitrator to prevent split-brain scenarios.
Runs on a single VM.

High-Availability Mechanisms

VRRP (Keepalived): Used to provide floating virtual IP addresses for frontend and backend services, ensuring flawless failover.
MariaDB Galera Cluster: Multi-master database cluster that ensures data consistency and failover between primary and secondary locations.
HSM Synchronization: Ensures cryptographic operations remain consistent across locations.
Quorum-based Failover: The arbitrator node prevents split-brain situations by participating in voting for database cluster integrity.

Data Flow and Failover Scenarios

Normal Operation:
- Location A handles primary PKI operations.
- Location B remains in sync, ready to take over.
- Location C acts as an arbitrator for database quorum.
Failure at Location A:
- Location B automatically takes over using VRRP failover mechanisms.
- Database operations remain available due to Galera Cluster.
Failure at Location B:
- Location A continues normal operations.
- Location C ensures database cluster stability.
Failure at Location C:
- No immediate impact unless a second location fails.
- Database failover still works between primary and secondary sites.

Conclusion

This architecture provides a robust HA design for a PKI platform, ensuring minimal downtime, database integrity, and secure cryptographic operations. By leveraging VRRP, MariaDB Galera Cluster, and HSM synchronization, the system achieves redundancy, failover readiness, and secure key management across multiple locations.