Command and Control of Clusters
Previous incarnations of this Frontend relied on the frontend webserver instance being able to SSH directly to clusters in order to performance command and control (C2) operations. When clusters were created, an admin user was set that would accept a public ssh key for which the webserver owned the private key. This was largely straightfoward, and worked quite well. The clusters were also able to make HTTP API queries to the webserver.
This works well in the case where webserver and clusters all have public IP addresses, and are able to receive inbound requests, but it breaks down in the case where a user may wish to have the compute clusters not be directly exposed to the public internet.
The new approach is based around Google PubSub. The webserver and clusters now no longer directly communicate via SSH and HTTP, but rather send messages via Google Cloud. This offers the advantage of supporting clusters that are not publicly accessible, and removes the reliance on SSH connections between webserver and clusters.
PubSub details
During deployment of a new Frontend system, a new Google PubSub Topic will be created. This centralized topic is used for all command and control traffice between the Frontend webserver and the client clusters. Individual Subscriptions are created for each cluster controller, as well as for the Frontend itself. The Frontend creates a new subscription for each cluster, and informs the cluster that it is to use the newly-created subscription.
Message Delivery
Each subscription is filtered based off of a message attribute. Messages to each Cluster MUST have an attribute target=cluster_X
where X
is the Cluster's unique ID. Messages to the Frontend MUST NOT have a target
attribute.
By using filtering in this way, and having a 1:1 mapping between Subscription and Recipient, we guarantee that the messages being sent are received by only the intended recipient.
Message Schema
Beyond the filtering attribute requirements previously discussed, the form of the messages are as follows:
Common Attributes
command
- The command being sent as part of the messagesource
- The identity of the sender - corresponds to thetarget
attribute. This is how the Frontend identifies which cluster sent the message
Common Message Data
ackid
- A UUID generated by the system to identify a command/response pair.
Each command will have additional data in the Message Data, specific to that command's requirements.
Commands
ACK
- Acknowledges a previous command, signals that the command is completeUPDATE
- Acknowledges a previous command, but signals that the command is not yet complete. Can be sent in response to other commands multiple times, to be finally followed by anACK
PING
,PONG
- Testing commands. Not typically usedCLUSTER_STATUS
- Cluster command to Frontend to indicate a change in the status of the cluster. For example, to signal that the cluster has finished initialization and is ready for jobs.SYNC
- Command to cluster to synchronize with the Frontend, including updating Log Files, and potentially other activities in the future (such as setting user permissions).SPACK_INSTALL
- Install a Spack packageRUN_JOB
- Submit a job on behalf of a user to SLURMREGISTER_USER_GCS
- Begin the process to register a user's GCS credentials withgsutil
.
Cluster C2 Daemon
During startup of a cluster, a Daemon is installed which creates a Streaming Pull thread to Subscribe to the Cluster's Subscription. This daemon is responsible for responding to C2 messages and following through on the message's requests, including submitting jobs to SLURM to install Spack packages, and run user's jobs.
Security
The C2 topic is created at deployment time, as well as the subscription for the Frontend. Topic creation permission is then no longer required by the Service Accounts of the Frontend or the Clusters.
When a Cluster is created, a new Service Account is created for that Cluster. This Service Account is then granted pubsub.subscriber
permissions to the C2 Topic, and pubsub.publisher
permissions to that cluster's own Subscription. The Service Accounts for the clusters are created without any Google PubSub IAM permissions, so these policy bindings on the topic and subscription are the only PubSub IAM permissions granted to the cluster's service accounts.
Sadly, the Frontend's Service Account must have either the role of pubsub.admin
or a custom role, set at the project level. This is because creating subscriptions and setting IAM Policy Bindings are actions done at the project level, rather than attached to the topic.
By setting IAM policy bindings, we are able to grant permissions to service accounts which are associated with clusters which are in projects other than the base project where the Frontend resides.
For example, if the Frontend is in GCP Project Alpha
, the C2 Topic will also be in Project Alpha
. If a cluster is then created in project Beta
, the Frontend will grant the cluster's Service Account IAM permissions within the Alpha
project.
Data Storage
Clusters automatically upload job logs to a GCS bucket, which is specified at cluster creation time. The Cluster's Service Account is granted ObjectAdmin permissions in order to create and update Log files in the GCS bucket.
The Frontend webserver displays log files from the GCS bucket.