An easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications.
Nomad is a simple and flexible workload orchestrator to deploy and manage containers (docker, podman), non-containerized applications (executable, Java), and virtual machines (qemu) across on-prem and clouds at scale.
Nomad provides several key features:
- Deploy Containers and Legacy Applications: Nomad’s flexibility as an orchestrator enables an organization to run containers, legacy, and batch applications together on the same infrastructure. Nomad brings core orchestration benefits to legacy applications without needing to containerize via pluggable task drivers.
- Simple & Reliable: Nomad runs as a single binary and is entirely self contained - combining resource management and scheduling into a single system. Nomad does not require any external services for storage or coordination. Nomad automatically handles application, node, and driver failures. Nomad is distributed and resilient, using leader election and state replication to provide high availability in the event of failures.
- Device Plugins & GPU Support: Nomad offers built-in support for GPU workloads such as machine learning (ML) and artificial intelligence (AI). Nomad uses device plugins to automatically detect and utilize resources from hardware devices such as GPU, FPGAs, and TPUs.
- Federation for Multi-Region, Multi-Cloud: Nomad was designed to support infrastructure at a global scale. Nomad supports federation out-of-the-box and can deploy applications across multiple regions and clouds.
- Proven Scalability: Nomad is optimistically concurrent, which increases throughput and reduces latency for workloads. Nomad has been proven to scale to clusters of 10K+ nodes in real-world production environments.
- HashiCorp Ecosystem: Nomad integrates seamlessly with Terraform, Consul, Vault for provisioning, service discovery, and secrets management.
security: Upgrade tooling to Go 1.25.8 [GH-27653]
IMPROVEMENTS:
acl (Enterprise): Added sentinel policy block to allow managing Sentinel policies without a management token [GH-27556]
acl: Added fine-grained ACL capabilities for saving snapshots and reading the Enterprise license [GH-27525]
acl: Added fine-grained ACL capability for rotating the keyring [GH-27526]
agent: Added agent.tls.cert.expiration_seconds and agent.tls.ca.expiration_seconds telemetry data points to track TLS certificate expiration. [GH-27538]
cli: Added autocompletions for ACL auth method, binding rule, policy, and token subcommands [GH-27505]
cli: Improved options autocompletions for various commands [GH-27506]
cli: Reduced server overhead when dispatching jobs or forcing periodic jobs from the CLI [GH-27631]
cli: Truncate results when job commands return a large set of jobs that match the provided ID prefix [GH-27631]
consul (enterprise): adds ability to specify cluster specific consul tokens with environment variables [GH-27574]
events: Added a Deleted flag to JobDeregistered event type to differentiate between stopped and deleted jobs [GH-27614]
BUG FIXES:
acl: Fixed a bug where a bearer-token authenticated request could panic the handler for checking claims [GH-27550]
artifact: Fix artifact inspection when using file mode [GH-27552]
config: Fixed a bug where the keyring block could only be specified a maximum of two times [GH-27579]
config: Fixed parsing of Vault and Consul blocks as JSON that included objects such as task_identity [GH-27595]
consul: fixes bug where clients were passing node token to connect envoy container, causing acl not found errors [GH-27574]
core: Fixed system jobs being rescheduled after a node is drained and marked eligible again [GH-27499]
deployments: Fixed a bug where a task group dropped from a system job could cause deployment state to be overwritten incorrectly [GH-27604]
deployments: Fixed a bug where system job canary state could be incorrectly changed after a promotion [GH-27497]
deployments: Fixed a bug where system job deployments would not be marked healthy even though all allocations were healthy [GH-27497]
drivers: Pass error when included in fingerprint response [GH-27537]
dynamic host volumes: Fixed a bug with sticky volumes where replacement allocations would not use the previous volume claim [GH-27613]
http: Ensure the correct HTTP protocol version is set on event stream responses [GH-27586]
job status: Fixes regression setting job status when jobs have matching prefix [GH-27516]
keyring (Enterprise): Fixed a bug where in mixed-version clusters with pre-1.9 servers, a keyring rotation that returns an error for an unavailable KMS could prevent future server restarts [GH-27581]
scheduler: Fix a potential panic in the system scheduler when deploying jobs with multiple task groups and infeasible nodes that become feasible [GH-27571]
scheduler: Fixed a bug where system deployments would not complete on clusters with pre-1.11.0 nodes [GH-27605]
state: Fixed a potential state store corruption bug in the service/batch scheduler and deployment watcher [GH-27548]