Location: New York, United States • London, United Kingdom (Office)
On-Site | Full-time
Compensation: Competitive Compensation
Our client is a high-growth software development organization and a key contributor to one of the largest and fastest-growing decentralized crypto social networks globally. The platform has achieved massive scale, generating significant revenue and global attention since its inception.
To support this rapid expansion and ensure the continuous uptime of its high-stakes, high-throughput environment, our client is seeking a battle-tested Site Reliability Engineering (SRE) Expert. This individual will be handed ambiguous, critical infrastructure challenges and will be trusted to navigate them end-to-end—scoping solutions, making sound architectural trade-offs, and executing with precision.
Key Responsibilities
- Own Foundation & Architecture: Design, scale, and maintain highly available, multi-region, or active-active cloud infrastructure patterns.
- Incident Response & Reliability: Lead critical incident response efforts, participate in real on-call rotations, and drive comprehensive, blameless post-mortems to continuously harden the system.
- Automation & Tooling: Write clean, production-grade automation code (Python, Go, or similar) for infrastructure tooling, operators, and seamless systems integration.
- Risk & Security Management: Exercise sharp judgment regarding system risks, balancing rapid deployment velocity with robust infrastructure safety and stability.
- Operational Excellence: Raise the engineering and operational bar across the organization through the implementation of rigorous standards, modern tooling, and technical mentorship.
- Core SRE & Infrastructure Focus: Deep expertise in infrastructure-as-code (Terraform/OpenTofu), network topology, high-availability architecture, and system internals.
- Proven Track Record: Experience building foundational infrastructure (ideally from 0→1) and running high-availability environments where reliability is treated with financial-system levels of seriousness.
- Cloud-Native Fluency: Advanced proficiency with modern cloud providers (AWS, GCP) and container orchestration platforms (Kubernetes).
- Pragmatic Problem Solver: Strong capacity to operate independently in high-stakes environments, deciding when to gather consensus versus when to execute autonomously.
Preferred Qualifications
- Security & Compliance: Experience with infrastructure security hardening, IAM architecture, or compliance mapping (e.g., SOC2, ISO).
- Data & Streaming Infrastructure: Hands-on experience managing and scaling high-throughput, low-latency data backbones and event streaming systems (Kafka, Redpanda, PostgreSQL).
- Digital Asset Exposure: A working understanding of Web3/crypto infrastructure patterns and comfort operating within them.