Cambridge Residency Programme: Next-Generation AI Datacentre Networking
Microsoft, Newtown, Cambridge
Cambridge Residency Programme: Next-Generation AI Datacentre Networking
Salary not available. View on company website.
Microsoft, Newtown, Cambridge
- Full time
- Temporary
- Onsite working
Posted 1 day ago, 8 Jun | Get your application in today.
Closing date: Closing date not specified
Job ref: fedc52b8b9e14055894a4ccc6e3cb6a7
Location ref: Newtown, Cambridge
Full Job Description
Track A - Modelling & Simulation Best suited to candidates whose primary strength is analytical reasoning, performance modelling, or simulation.
- Design and analyse novel network architectures (e.g., hybrid optical-electrical, reconfigurable topologies) tailored for AI communication patterns.
- Develop analytical models and simulators to quantify the performance, cost, and energy trade-offs of proposed designs.
- Study architectural trade-offs involving topology, transport, collective communication, and emerging optical/networking hardware.
- Collaborate with systems researchers to compare model predictions with testbed measurements.
- Evolve existing evaluation tools and frameworks to address new research questions and scenarios relevant to product teams. Track B - Systems Implementation & Experimental Validation Best suited to candidates whose primary strength is building and evaluating real systems on experimental platforms.
- Implement and evaluate network protocols, transport mechanisms, and collective communication schemes on experimental hardware testbeds featuring modern GPUs, optical circuit switches, and RDMA interconnects.
- Build and run communication-intensive workloads (e.g., collective algorithm benchmarks, distributed training/inference jobs) to stress-test new network designs.
- Co-design and validate new protocols and algorithms with modelling collaborators.
- Drive experimental validation on the group's testbed and contribute to its continued evolution.
- Expand existing tools and prototypes to address scenarios relevant to both research and product teams. In both tracks, you will publish findings at top-tier academic venues and contribute to Microsoft's long-term AI infrastructure strategy.
- PhD in Computer Science, Computer Engineering, Electrical Engineering, Applied Mathematics, Operations Research, or a related field.
- Evidence of independent research, such as first-author publications, strong thesis work, or impactful prototypes.
- Ability to communicate research clearly through papers, talks, and cross-functional collaboration.
- Strength in at least one of the following areas:
- Modelling & simulation (Track A): Demonstrated experience in analytical modelling, simulation, or performance evaluation of networks or distributed systems (e.g., queueing models, flow-level simulation, stochastic models, LP-based analysis, or alpha-beta models).
- Systems implementation (Track B): Strong systems programming skills in C++/CUDA/Python, with hands-on experience building or evaluating networked systems, distributed systems, or AI training/inference infrastructure., Experience with datacentre network architectures, transport protocols, or collective communication.
- Familiarity with circuit-switched or optical networking concepts (e.g., optical circuit switches, co-packaged optics).
- Understanding of AI/ML workload communication patterns (e.g., all-reduce, MoE routing, pipeline parallelism).
- Experience building simulators, evaluation frameworks, or experimental prototypes.
- Proficiency in Python and familiarity with scientific computing libraries (NumPy, SciPy, pandas).
- Experience in one or more of the following systems areas:
- High-performance networking: RDMA (RoCEv2, InfiniBand), transport protocol implementation, or congestion control.
- GPU and distributed ML communication: CUDA programming, NCCL, or experience with ML training/inference systems (e.g., PyTorch, Megatron, vLLM).
- Experimental infrastructure: Building or managing hardware testbeds, measurement and profiling. This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft Research Cambridge is hiring two researchers for its Cambridge Residency Programme (two-year postdoctoral positions) to advance the design and evaluation of next-generation datacentre networks for AI workloads. We are seeking to hire a collaborative pair of researchers with complementary profiles: one focused on analytical modelling and simulation, the other on systems implementation and experimental validation.