CSCI 7000 - Transformers for Robotics
Deep neural networks based on self-attention are revolutionizing robotics with their ability to perform "open world" reasoning across multiple modalities including text and images, and their ability to generate multi-modal data ranging from text to images and robot trajectories. This class starts with an introduction to the transformer architecture using large language models as an example. We will then introduce vision transformers and contrastive learning image pretraining (CLIP) that allows combining text and image information in the same model, and finally show how diffusion models can be used to directly generate robot trajectories from a text prompt and an image of a visual scene. After implementing your own CLIP model and reviewing recent work in the robot-learning literature, concepts in this class will be explored in a team-based research project.
Goal of this class
- Fundamentally understand underlying transformer-based models with applications to robotics and its limitations
- Learn about complementary tools in particular symbolic planning and probability
- Advance the current state of the art by performing an independent research project
Grading
- Attendance: 16*2.5% -> 40%
- Implement CLIP model on own data and blog article -> 20%
- Paper review and blog article -> 10%
- Final project -> 30%
Meetings
- Monday, 10:10am - 12:40pm in ECEE 283