CSCI 7000 - Transformers for Robotics | Correll lab

Deep neural networks based on self-attention are revolutionizing robotics with their ability to perform "open world" reasoning across multiple modalities including text and images, and their ability to generate multi-modal data ranging from text to images and robot trajectories. This class starts with an introduction to the transformer architecture using large language models as an example. We will then introduce vision transformers and contrastive learning image pretraining (CLIP) that allows combining text and image information in the same model, and finally show how diffusion models can be used to directly generate robot trajectories from a text prompt and an image of a visual scene. After implementing your own CLIP model and reviewing recent work in the robot-learning literature, concepts in this class will be explored in a team-based research project.

Goal of this class

Fundamentally understand underlying transformer-based models with applications to robotics and its limitations
Learn about complementary tools in particular symbolic planning and probability
Advance the current state of the art by performing an independent research project

Grading

Attendance: 16*2.5% -> 40%
Implement CLIP model on own data and blog article -> 20%
Paper review and blog article -> 10%
Final project -> 30%

Meetings

Monday, 10:10am - 12:40pm in ECEE 283

Search

Other ways to search:

CSCI 7000 - Transformers for Robotics