About the Role You will design, build, and operate the infrastructure that supports large-scale AI and agent systems. You will create reusable CI/CD workflows for training, evaluation, and deployment; automate model versioning, approvals, and compliance checks; and assemble modular stacks including vector databases, feature stores, and model registries. You will integrate and evaluate cutting-edge LLM tools, instrument observability and monitoring, and deploy online and offline evaluation pipelines with regression testing, cost monitoring, and human-in-the-loop workflows. You will collaborate with engineers and data scientists to embed models and agents into real-time applications, provide sandboxes and reproducible environments for researchers, and continuously improve model performance, reliability, and governance.
Requirements Write high quality maintainable software primarily in Python Strong background in scalable infrastructure including Docker and Kubernetes Experience with infrastructure as code and deployment tools such as Terraform and CI/CD pipelines Familiarity with monitoring and logging frameworks such as Datadog Prometheus and OpenTelemetry Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection Experience with scalable model and agent serving infrastructure such as vLLM Triton and BentoML Experience deploying and maintaining LLM and agentic workflows in production including cost latency and performance monitoring Experience capturing traces for analysis debugging and optimizing prompt response flows with real time data access Strong ownership pragmatism and ability to balance infrastructure design with iterative delivery
Responsibilities Build reusable CI/CD workflows for model training evaluation and deployment Automate model versioning approval workflows and compliance checks Build modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling Partner with engineering and data science to embed AI models and agents into real-time applications and workflows Continuously evaluate and integrate state of the art AI tools and frameworks Drive AI reliability and governance to ensure compliance security and uptime Ensure data accuracy consistency and reliability for model training and inference Deploy infrastructure to support offline and online evaluation including regression testing cost monitoring and human in the loop workflows Enable researchers with sandboxes dashboards and reproducible environments Improve AI and ML model performance
Benefits Remote work Eligibility to participate in TRM's equity plan Funding Investors Staff Data Platform Engineer Crystal Blockchain B.V. · 1 day ago Senior Cyber Security Engineer Avara · 1 day ago Senior Application Security Engineer Tangem AG · 1 day ago Senior DevOps Engineer Avara · 1 week ago Principal Engineer, Agentic Engineering CoinDesk · 4 weeks ago Funding Investors Staff Data Platform Engineer Crystal Blockchain B.V. · 1 day ago Senior Cyber Security Engineer Avara · 1 day ago Senior Application Security Engineer Tangem AG · 1 day ago Senior DevOps Engineer Avara · 1 week ago Principal Engineer, Agentic Engineering CoinDesk · 4 weeks ago