HSM intermittent failures - How to debug properly?

nk_hsm · March 27, 2025, 9:18am

We are making use of the Nitrokey HSM to issue leaf certificates for a PKI setup. For this we call openssl to work with the pkcs11 engine that can communicate with the HSM.

This works fine for the majority of time, but if we do a lot of requests in short succession, it seems that the key (or pcsd daemon?) gets into a state were it no longer can communicate, throwing a vague error:

Failed to enumerate slots
PKCS11_get_private_key returned NULL
Could not read CA private key from org.openssl.engine:pkcs11:0:10
40E7B621687F0000:error:41800005:PKCS#11 module:ERR_CKR_error:General Error:p11_slot.c:445:
40E7B621687F0000:error:40000067:pkcs11 engine:ERR_ENG_error:invalid parameter:eng_back.c:603:
40E7B621687F0000:error:13000080:engine routines:ENGINE_load_private_key:failed loading private key:../crypto/engine/eng_pkey.c:79:

The PCKS11/PKCS15 tools also fail to enumerate the devices at this point.

Did anyone on the forum encounter this before (or know a way how we can better debug this to find out what might be causing the problem)?

A full reboot + an replug of the Nitrokey HSM fixes the issue every time up to now, but that’s not a nice solution if we encounter this on a remote system…

saper · March 31, 2025, 9:51pm

I don’t see this and I have some HSM’s plugged into USB ports and the USB sharing device for few years already.

Can you check if you get some reports from the USB stack that the device is gone/disconnected?

nk_hsm · April 1, 2025, 7:10am

As far as we can see, the device does not disconnect:

We’re using the latest versions of libp11 (0.4.13) and opensc (0.26.1) to no avail (we upgraded from a previous version to be sure it wasn’t caused by a known bug).

Debug output came from running the PCSCD daemon with verbose logging and running the tools with debug verbosity.

saper · April 5, 2025, 8:31pm

I’ve seen something like this - during the attempt of the parallel use of the card, on in short succession, the card is “busy” and not available. Which PKCS#11 driver are you using?

nk_hsm · April 7, 2025, 7:05am

The one from opensc: opensc-pkcs11.so; signing of certificates goes over OpenSSL commandline, and is normally not in parallel (but we do have a cron job requesting hsm info over the pkcs11-tool).

We are indeed seeing this happen every now and then when doing multiple requests in short time, but did not find a reproduction scenario yet where it can be triggered every time. As stated in the original post, the annoying thing is that if we get into this state, only a reboot and re-plug of the device seem to be working to bring it back to a normal state.

That’s why I’m inquiring on how I could possibly debug this so we can report an issue (or find a workaround)

saper · April 7, 2025, 6:05pm

This could be it.

As far as the debugging goes, pcscd -adf debug logs are great (you probably have them already - please no screenshots…)

OpenSC has also great debug logs in the opensc.conf

app default {
 debug = 9;
 debug_file = /tmp/opensc-debug.log;
}

but I believe your problem is a low level one.

Card ATR CE 00 00 is certainly incorrect one. Maybe this is a leftover from the parallel operation. Maybe the card could not be reset fully or some other operation was running in parallel (like your cron check).