About the Role You will build and maintain the infrastructure and pipelines that enable production AI systems. You will design CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance and observability tooling. You will integrate and evaluate state-of-the-art LLM and agent tools, deploy scalable model serving, monitor cost, latency and performance, and run offline and online evaluations including human-in-the-loop processes. You will provide reproducible sandboxes and dashboards so researchers and engineers can iterate quickly and reliably.
Requirements Write high-quality maintainable software primarily in Python Experience with containerization and orchestration such as Docker and Kubernetes Experience with infrastructure-as-code and deployment tooling such as Terraform and CI/CD pipelines Experience with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry Implement MLOps best practices including model versioning rollback strategies automated evaluation and drift detection Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance and capturing traces Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery
Responsibilities Build reusable CI/CD workflows for model training evaluation and deployment Automate model versioning approval workflows and compliance checks Build modular and scalable AI infrastructure including vector database feature store model registry and observability tooling Embed AI models and agents into real-time applications and workflows Continuously evaluate and integrate state-of-the-art AI tools Drive AI reliability governance and uptime Ensure data accuracy consistency and reliability for training and inference Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human-in-the-loop workflows Provide sandboxes dashboards and reproducible environments for researchers
Benefits Equity plan eligibility Funding Investors Staff Data Platform Engineer Crystal Blockchain B.V. · 1 day ago Senior Solutions Engineer Sardine · 1 day ago Senior DevOps Engineer Avara · 1 week ago SRE Manager XREX · 1 week ago Principal Engineer, Agentic Engineering CoinDesk · 4 weeks ago Funding Investors Staff Data Platform Engineer Crystal Blockchain B.V. · 1 day ago Senior Solutions Engineer Sardine · 1 day ago Senior DevOps Engineer Avara · 1 week ago SRE Manager XREX · 1 week ago Principal Engineer, Agentic Engineering CoinDesk · 4 weeks ago