Software Engineer - Distributed Systems
GW435
Posted: 29/04/2026
- $250,000-$270,000
- San Francisco, CA
- Permanent
About the job
Software Engineer - Distributed Systems
We’re working with a well-funded Series A company building a new class of cloud infrastructure for AI. They’re tackling a fundamental problem: today’s AI systems are tightly coupled to specific hardware, creating limits in cost, scale, and efficiency.
Their approach decouples workloads from hardware — dynamically partitioning and scheduling them across heterogeneous compute (GPUs, accelerators, multi-gen systems). This is deep, production-grade distributed systems work operating at real scale.
What you’ll do
- Own core distributed systems from design → build → deployment → operation
- Design scheduling, routing, and resource management systems across thousands of nodes
- Build production-grade control planes and APIs for workload orchestration
- Make explicit tradeoffs around performance, reliability, and efficiency at scale
- Debug complex distributed failures and continuously improve system behaviour
What makes this interesting
- High ownership: you’re building foundational infrastructure, not abstracted layers
- Real scale: systems designed for large, multi-cluster / datacenter environments
- Hard problems: concurrency, scheduling, failure modes, and resource allocation
- Heterogeneous compute: working beyond standard cloud abstractions
- Early-stage: opportunity to shape architecture with real production constraints
We’re looking for
- Engineers who have built or operated distributed systems in production
- Strong fundamentals in concurrency, systems design, and failure handling
- Evidence of ownership over meaningful systems (not just contributions)
- Comfort reasoning about tradeoffs in large-scale environments
- Ability to clearly explain design decisions and system behaviour
It's not necessary, but it's great if you have:
- Experience with Kubernetes or similar systems beyond basic usage
- Background in scheduling, queues, or resource management systems
- Experience designing service-oriented architectures (RPC, async systems)
- Systems-level programming experience (e.g. Go, C++, Python)
Anna Heneghan
Senior ML Research & Engineering Recruiter