Metagov
About the Role You will be the technical owner of the platform's operational backbone. You will harden the platform for major launches, perform load testing, and build fallback routing and per-agent monitoring. You will implement end-to-end observability and integrated trace analysis across heterogeneous infrastructure, ship downtime warnings and fallback behavior, and implement routing transparency and endpoint provenance so users can verify which backend served their inference. You will improve performance of public endpoints, integrate programmatic infrastructure interfaces such as an MCP server, and make the utility more transparent and contributable. You will set priorities autonomously, operate production inference and ML serving infrastructure, and coordinate with cloud providers, HPC centers, and other infrastructure partners. Occasional travel for team workshops may be required.
Requirements Significant experience operating production inference or ML serving infrastructure (vLLM, model routing, multi-region deployments, GPU-backed services) Strong distributed systems and SRE instincts including observability, incident response, fallback design, and capacity planning Comfort working across heterogeneous infrastructure partners including cloud providers and HPC centers Experience orchestrating many stacks and integrating open-source projects Maintainer and integrator experience with pride in operational excellence Ability to work autonomously in a small team and travel occasionally for workshops
Responsibilities Harden platform for launches Perform load testing Build fallback routing Set up per-agent monitoring Build end-to-end observability across stacks Ship downtime warnings and fallback behavior Implement routing transparency and endpoint provenance Improve production service performance Integrate MCP server or programmatic infrastructure interfaces Make infrastructure transparent and contributable Operate and maintain production inference and ML serving infrastructure Coordinate with heterogeneous infrastructure partners Orchestrate and integrate multiple open-source stacks Site Reliability Engineer TransFICC · 5 days ago SRE (Terminal) Baton Corporation · 6 days ago Senior DevOps Engineer Avara · 1 week ago DevOps Engineer Dunamu · 1 week ago Senior DevOps Specialist VIA Science, Inc. · 2 weeks ago Site Reliability Engineer TransFICC · 5 days ago SRE (Terminal) Baton Corporation · 6 days ago Senior DevOps Engineer Avara · 1 week ago DevOps Engineer Dunamu · 1 week ago Senior DevOps Specialist VIA Science, Inc. · 2 weeks ago
Metagov
Metagov