About the Role As a Senior Network Engineer, you will design, deploy, maintain, optimize, and troubleshoot high performance AI infrastructure networks that support GPU compute clusters, storage fabrics, and large scale AI workloads. You will focus on High Speed Ethernet, InfiniBand, EVPN VXLAN fabrics, BGP routing, GPU cluster networking, storage integration, and secure multi tenant AI environments. You will work with hyperscale or HPC networking, AI cluster deployments, data center operations, and advanced troubleshooting across compute storage and network fabrics. You will coordinate with colleagues across compute storage and network domains to ensure reliable and scalable AI infrastructure.

Requirements 5+ years of enterprise or data center networking experience 2+ years supporting AI, HPC, or GPU cluster environments Strong experience with: BGP, EVPN/VXLAN, VLANs, MLAG, High-Speed Ethernet (100G/200G/400G/800G), InfiniBand, and Cumulus Experience with NVIDIA UFM and AI fabric management Strong Linux administration skills Experience troubleshooting GPU cluster communication issues Experience with enterprise firewalls and network segmentation Understanding of AI workload traffic patterns and storage networking Experience with WEKA, RoCEv2, NCCL, Kubernetes, Slurm, CUDA environments Experience deploying SMDCBBS (Supermicro Datacenter Building Block Solutions) Familiarity with liquid-cooled GPU infrastructure Experience with AI inference and training environments Scripting or automation experience in Python and Bash, Ansible, API integrations

Responsibilities Design and maintain High Speed Ethernet and InfiniBand fabrics for GPU clusters Deploy and manage EVPN VXLAN spine leaf architectures Configure and maintain BGP routing VRFs VLANs MLAG and high availability networking Design resilient AI infrastructure supporting multi tenant environments and large scale GPU workloads Optimize east west traffic flows for AI training and inference workloads Support NVIDIA GPU cluster networking including NCCL optimization GPUDirect RDMA RoCEv2 InfiniBand subnet management Troubleshoot cluster communication issues including link flapping congestion latency and throughput bottlenecks Execute and validate NCCL performance testing Deploy and support high performance storage platforms including WEKA and distributed AI storage systems Configure storage VLANs passthrough networking and bonded interfaces Optimize storage throughput and low latency communication between compute and storage environments Configure and maintain enterprise firewalls including NAT VIPs VPN IPSec Traffic shaping Security segmentation Implement secure multi tenant access controls Assist with AI governance and controlled AI integration environments Develop infrastructure automation for network provisioning firewall policy deployment VLAN assignments server imaging Implement monitoring and alerting through Grafana and telemetry systems Support API driven infrastructure management and orchestration Assist with deployment and operational planning for AI data center infrastructure Support Tier III resiliency planning and redundancy validation Coordinate with facilities power mechanical and external ISP providers

Benefits Medical Dental Vision Life insurance Short-term disability Long-term disability Paid time off Software Engineer Robinhood Markets, Inc. · 9 hours ago Senior Software Engineer (Streaming Data Pipeline) Covalent · 9 hours ago People Systems and AI Operations Engineer Chainlink · 2 days ago Software Engineer, Data Growth Chainlink · 2 days ago Software Engineer, Experience Platform Paxos · 2 days ago Software Engineer Robinhood Markets, Inc. · 9 hours ago Senior Software Engineer (Streaming Data Pipeline) Covalent · 9 hours ago People Systems and AI Operations Engineer Chainlink · 2 days ago Software Engineer, Data Growth Chainlink · 2 days ago Software Engineer, Experience Platform Paxos · 2 days ago

Senior Network Engineer

Skills

Description

Similar Jobs

Data Center Technician