About PageshiftPageshift is a Research Lab committed to pushing the frontier of AI storytelling and creativity. We are envisioning a world in which most entertainment is personalized and AI-generated. Our goal is to build the underlying story engine that powers it all. To do this, we are not afraid to explore new ways and create novel categories of model capability.About the role
You are expected to build and maintain a cluster-scale ML training codebase for ultra-long-context language model training, spanning supervised fine-tuning (SFT) and reinforcement learning (RL). You will implement and iterate on custom training loops, improve stability and performance in distributed environments, and modify existing LLM architectures to better support long-context training.
You are expected to work hands-on across debugging, profiling, and experimentation. This includes identifying bottlenecks, fixing scaling issues, and turning new ideas into reliable training runs. You are expected to independently design, implement, run, and evaluate smaller experiments end-to-end, and use the results to guide subsequent training and architecture changes.You are a good fit if:
- Passion for entertainment and storytelling
- Willingness to work on the tough hard problems instead of easy and hype ones
- Good understanding of ML and LLM fundamentals (Transformers, Attention, Tokenizer, GPT training objectives)
- Have trained / fine-tuned LLMs in the past
- Experience with either JAX or PyTorch
- Generally up to date with current models in the ML spaceNice to have:
- Example project to show off (API prompting projects do not count)
- Have worked with / implemented distributed training systems beforeYour responsibilities:
- Implement and maintain a cluster-scale ML training codebase for SFT and RL training
- Implement custom training loops
- Modify existing LLM architectures
- Run smaller experiments
