Sr. Machine Learning Infrastructure Engineer, Optimus

Tesla
Palo Alto, California
Full Time

Email Address

Apply Now

As a Software Engineer for the Optimus team, you will build the tools and infrastructure to make and measure improvements to neural network architecture, visualize data, assist with exporting and deploying neural networks to the bot, and evaluate experimental results. You will help us automate the entire workflows of training, validation, and production of the Optimus. Most importantly, you will see your work repeatedly shipped to and utilized by thousands of Humanoid Robots in real world applications.

Build and improve our Python training infrastructure for stable and faster training
Build the tooling and infrastructure for reporting and visualizing model metrics and performance
Build the pipelines to run and validate our PyTorch models
Manage, analyze, and visualize our training and test datasets
Coordinate with the team managing the hardware cluster to maintain high availability / jobs throughput for Machine Learning
Build and improve tooling to deploy trained neural nets to Tesla hardware
Practical experience programming in Python and/or C++
Proficient in system-level software, particularly hardware-software interactions and resource utilization
Understanding of modern machine learning concepts and state of the art deep learning
Experience working with training frameworks, ideally PyTorch
Demonstrated experience scaling neural network training jobs across clusters of GPUs
Optional: Previous experience in deep learning deployment
Optional: Profiling and optimizing CPU-GPU interactions (pipelining compute/transfers, etc)