Job Details

Senior SRE Engineer, AI Infrastructure

Santa Clara, California, United States
We are looking for a highly motivated Senior Software Engineer Developer.

Productivity - Engineering Effectiveness to join our team in the AI infrastructure group. There is an excellent opportunity to architect and drive services reliability strategy for a Deep Learning platform powering development of the next generation of Perception for the Autonomous Vehicles at NVIDIA! If you are passionate about SRE, Kubernetes, improving productivity of your colleagues and working on new technologies and trends, there is a broad range of tools, APIs and platforms in use here.

What you'll be doing:

* As part of the Engineering Effectiveness team you will propose and craft new ways to monitor and address reliability of the AI platform services.

* Drive improvements for the effectiveness across AV development teams.

What we need to see:

* BS or MS in the CS/CE/EE or equivalent experience.

* 10+ years of Engineering Effectiveness experience on the cloud and on-prem.

* You are versatile with at least one programming languages like: Go, Python.

* Complete understanding of the Kubernetes and Docker internals.

* Know how to deal with scale by applying Terraform and Ansible.

* Experience deploying services in the cloud and at the datacenter and managing them.

* Deep knowledge of the networking fundamentals.

* Expertise at problem solving and complexity analysis on the large distributed systems.

* Proficient with Linux environment.

* Excellent written and verbal interpersonal skills.

Ways to stand out from the crowd:

* Help our team to develop a better stack of Go automation APIs.

* Be a meticulous organizer with an ever positive, can-do attitude.

* Demonstrate use of out-of-box thinking for creative solutions to highly sticky problems.

* Be a fun and hardworking teammate who enjoys a challenge and celebrates success.

* Previous experience with leading Engineering Effectiveness engineering on the 100+ nodes GPU clusters.

* System design for the distributed or cloud compute.

* Knowledge of middleware and backend systems like Redis, MongoDB, Kafka and ElasticSearch.

* Extensive experience with monitoring and alerting systems, Prometheus.

* Proficiency in crafting highly available applications and modern deployment strategies like CI/CD, canary releases and blue/green.

For two decades, we have pioneered visual computing, the art and science of computer graphics. With our invention of the GPU - the engine of modern visual computing - the field has expanded to encompass video games, movie production, product design, medical diagnosis and scientific research. Today, we stand at the beginning of the new AI computing era, ignited by a new computing model, GPU deep learning. This new model - where deep neural network is trained to recognize patterns from extensive amounts of data - has shown to be deeply effective at solving some of the most ambitious problems in everyday life.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression , sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Send application

Mail this job to me so I can apply later

Apply With CV

You are not logged in. If you have an account, log in to your account. If you do not have an account, why not sign up? It only takes a minute!

latest videos

Upcoming Events