GOVSTAR Logo GOVSTAR
⏩ GOVSTAR Logo
@

AI Cloud Platform Engineer (Remote)

πŸ’° $140,000 - $160,000 🌍 undefined, Delaware; undefined, North Carolina; undefined, Virginia; Washington, District of Columbia; undefined, Maryland; Arlington, Virginia πŸ“… 06/22/2026

Apply

Job Description

Looking for PURPOSE? At GovStar, we found it. We love the work we do, the
people we do it with, and the impact we have. We're looking for talented
technologists and creatives with the character, courage, and commitment to
achieve greatness and grow together.

We're a tight group of individuals coming together to produce Legendary Teams.
We are using technology and creativity to unlock the future of government and
shape a brighter future for the country we love.

Discover us @ [www.govstar.us](http://www.govstar.us)

GovStar seeks an **AI Cloud Platform Engineer** to design, secure, and operate
the cloud foundation for mission-scale AI/LLM workloads. You will own the
cloud landing zones, networking, identity, security, and compliance that
enable reliable, cost-effective LLM serving and data services.

### **Responsibilities**

Responsibilities include, but are not limited to:

* Design and operate secure, multi-account/tenant cloud landing zones for AI workloads (Azure primary, AWS secondary), including network segmentation, private connectivity, and egress controls.
* Provision and manage GPU-optimized compute and storage for training and inference (AKS/EKS GPU node pools, VM scale sets, blob/S3 object storage).
* Implement Infrastructure as Code with Terraform for all environments; enforce policy-as-code and guardrails.
* Establish identity, secrets, and access controls (Entra ID/AWS IAM, RBAC, Key Vault/KMS, HSM/PKI).
* Build observability and SRE practices (metrics, logs, tracing, alerting, runbooks, incident response, SLIs/SLOs).
* Harden environments for compliance (e.g., FedRAMP/FISMA-style controls), including vulnerability management and continuous compliance monitoring.
* Enable CI/CD for infrastructure and configuration (GitOps, automated drift detection, change management).
* Partner with AI Kubernetes Engineers to support reliable LLM deployment strategies (blue/green, canary, rollout/rollback) and capacity planning.
* Optimize cloud spend for GPU and storage workloads; forecast capacity and implement reservations/savings plans.
* Document architectures, standards, and operational procedures; contribute to knowledge sharing and training.

### **Minimum Requirements**

* Must be a U.S. citizen.
* Must currently reside in the United States.
* Ability to obtain and maintain a Public Trust clearance.
* BA/BS in Computer Science or related field.
* At least 10 years relevant professional experience.
* Strong hands-on cloud engineering experience in one or more of Azure, AWS, GCP. Including networking, identity, security, and automation.
* Expert Terraform skills; proven experience managing production infrastructure via IaC.
* Experience operating production Kubernetes clusters (AKS/EKS) supporting AI/LLM workloads.
* Proficiency with containers (Docker) and CI/CD.
* Solid scripting/programming abilities in Python or similar.
* Experience implementing observability (metrics, logging, tracing) and on-call operations.

### Preferred Requirements****

* Experience with Azure services supporting AI solutions (e.g., Azure OpenAI, Document Intelligence, Azure App Service, Azure Machine Learning).
* Experience with AWS AI/ML services (e.g., Bedrock).
* GPU platform operations (NVIDIA drivers/CUDA, MIG, NCCL, multi-node scheduling).
* Data services for AI applications (PostgreSQL, Cosmos DB, Redis), and object storage lifecycle design.
* Network design for secure workloads (VNets/VPCs, Private Link/Private Endpoints, ExpressRoute/Direct Connect, WAF).
* Security tooling (Azure Policy, Defender for Cloud, AWS Security Hub), zero-trust patterns, and secrets management (Vault).
* Experience supporting U.S. federal environments and working within regulated cloud controls.

### Relevant Technologies****

* Cloud: Azure (AKS, VNets, Private Link, ExpressRoute, AML, ACR, Key Vault, Azure Monitor), AWS (EKS, VPC, PrivateLink, CloudWatch, ECR, KMS, Bedrock)
* IaC: Terraform
* Containers/Orchestration: Kubernetes, Helm
* Observability: Prometheus, Azure Monitor, CloudWatch
* Data: PostgreSQL, Cosmos DB, Redis, Blob/S3
* CI/CD and Version Control: GitLab, GitLab CI/CD