Hi,
We have two NetHSMs, located on two different physical locations for redundancy. They serve an application that runs locally, but the two NetHSMs are also kept in sync so that one can step in for the other if one of them were to fail.
The way we manage them at the moment is that we make changes to one of them (mostly create key pairs and upload certificates), back it up and then restore it to the other NetHSM in order to keep them in sync.
This works, but it also imposes some severe limitations on how we can use them. For instance we can only generate new key pairs, load certificates etc on one of them at a time and then sync these changes to the other instance. We cannot do it both ways.
The scenario
In our scenario we would want to produce a large number of key pairs (a few million) at both sites at the same time. Currently we would have to do this outside the NetHSM. Which is far from optimal. The scheme would be something like this:
- Generate key pair outside NetHSM
- Generate random AES key
- Encrypt private key using AES key
- Encrypt AES using RSA OAEP and an RSA key stored in the NetHSM
- Store encrypted AES key so we can recover private key later
- Zero out memory that held the private key
Since we need to generate these keys at both sites at the same time we can’t use backup-based synchronisation, as there is no way to merge differences. We could, of course, have both sites use the same NetHSM and configure them in a way where we periodically backup the primary and restore to a secondary HSM, but since the two sites are physically separate and there may be network partitions, this would not be practical.
A proposal
I think there is a way to solve this that doesn’t require complex two-way synchronisation and that it is probably possible to implement this in a series of discrete, incremental steps.
Backup and restore of namespace contents
The first step would be to make it possible to make backups of namespaces rather than the entire NetHSM state. If we could back up the contents of a namespace, copy it to the other NetHSM and restore it there, that would probably make it possible for us to create a setup where we could copy the contents of a namespace from one NetHSM to another.
Make NetHSM aware of primary and replica status
The second step might be to make the NetHSM aware of whether it is the primary for a namespace or a replica. So that you can configure it to deny mutating the replica, but so that you could use it as a read replica (enabling use of the keys, but not creation, modification or deletion).
Of course, this would have to be configurable so in the case you lose a primary, you can promote the replica to become a primary until you have a new primary in place (which would then be populated from the replica, and be configured as the new primary).
The application would still perform the backup from the primary and the restore to the replica (which in some scenarios might be desirable if the primary and the replica are isolated from each other).
Scheduled replication
The third step might be to make replication part of how namespaces can be configured. That is, so you can configure a namespace on the NetHSM to be aware that it is a primary, and that it has one or more replicas that it should replicate the state of the namespace to. For instance on a schedule.
Streaming replication
The fourth step might be to make it possible to replicate changes to individual keys or key pairs (in a name space), so that any changes made on the primary get replicated to the replica instance(s) in near real-time (except in the case of a network partition).
Single key backup
Another possibility is if the NetHSM could implement support for backing up a single key which can then be loaded onto the replica NetHSM. This might be easier for you to implement as it would offload the entire burden of the replication logic to an application.