Last updated: 2025-10-24
163 Site Reliability Engineering jobs in San Francisco.
Baseten
Join our dynamic team at Baseten, where we’re revolutionizing AI deployment with cutting-edge inference infrastructure. Backed by premier investors such as IVP…
San Francisco
- Skills: Site Reliability Engineer, Kubernetes, Scalable Infrastructure, Infrastructure-as-Code, CI/CD Tools, Project Management, Collaboration, Mentorship, Performance Optimization, Machine Learning
- Level: mid
- Type: full_time
Hive
Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and mos…
San Francisco
- Skills: cloud-based AI solutions, machine learning, DevOps, Site Reliability, automation, enterprise SaaS, distributed computing, high performance computing, hybrid infrastructure, GPU integration
- Level: mid
- Type: full_time
NewsBreak
NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, ou…
Mountain View
- Skills: AWS, Kubernetes (EKS), EMR (Elastic MapReduce), service reliability, fault-tolerant architectures, Infrastructure-as-Code (IaC), CI/CD pipelines, monitoring tools (Prometheus, Grafana), high-availability strategies, incident response
- Level: mid
- Type: full_time
Luma AI
Palo Alto
- Skills: Site Reliability Engineer, SRE, Infrastructure, GPU clusters, H100 GPUs, Monitoring tools, Management tools, Performance problems, Maintenance problems, Data Processing
- Level: mid
- Type: full_time
Replit
Replit is the fastest way to turn ideas into software. With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural languag…
Foster City
- Skills: Site Reliability Engineering, SRE, Infrastructure Automation, Monitoring Solutions, Infrastructure as Code, CI/CD Pipelines, Incident Management, Performance Optimization, Distributed Systems, Cloud-native Technologies
- Level: mid
- Type: full_time
Twitter is a social media platform that allows users to post and interact with messages known as tweets.
San Francisco
- Skills: site reliability engineering, team leadership, engineering collaboration, technical design, reliability practices, coaching, team empowerment, personal development, cross-team communication, system scalability
- Level: mid
- Type: full_time
Coupang
Coupang is a leading force in South Korean commerce, known for its exceptional customer service and innovative approach to retail and e-commerce. The company b…
Mountain View
- Skills: observability solutions, monitoring, alerting, logging, tracing, Kubernetes, DevOps, SRE practices, cloud-based infrastructure, performance indicators
- Level: mid
- Type: full_time
Palo Alto Networks
Palo Alto Networks is a cybersecurity company that offers advanced firewalls and cloud-based security services to secure the digital transformation.
Santa Clara
- Skills: DevOps, Site Reliability Engineering, Cortex, Security, Engineering Management, Cloud, Platforms, Production Operations, AI, Software Development
- Level: mid
- Type: full_time
Orb
Orb is on a mission to revolutionize billing infrastructure for the modern era of AI and software. We empower businesses to align their monetization with produ…
San Francisco
- Skills: infrastructure, reliability, observability, scalability, performance-critical, event processing, cloud, AWS, resiliency, mentorship
- Level: mid
- Type: full_time
Celonis
Celonis helps some of the world’s largest and most esteemed brands make processes work for people, companies, and the planet. With over 5,000 enterprise custom…
Redwood City
- Skills: Site Reliability Engineering, SRE principles, observability, automation, incident prevention, cloud platforms, Java, Python, Kubernetes, error budgets
- Level: senior
- Type: full_time
Loft Orbital
Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally requir…
San Francisco
- Skills: Site Reliability Engineering, Cloud Infrastructure, DevOps, satellites, space operations, integration, delivery, reliability, automated infrastructure, SatDevOps
- Level: mid
- Type: full_time
Crusoe Energy Systems
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solution…
San Francisco
- Skills: Site Reliability Engineering, AI infrastructure, automation, monitoring, incident response, system performance, network programming, security best practices, CI/CD, cloud infrastructure
- Level: senior
- Type: full_time
Crusoe
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solution…
San Francisco
- Skills: Site Reliability Engineering, AI infrastructure, production systems, system reliability, automation, monitoring, Unix/Linux, Cloud, Kubernetes, CI/CD
- Level: mid
- Type: full_time
Speak
Speak is on a journey to fix the language learning experience by creating AI-powered conversational tools to help billions gain fluency.
San Francisco
- Skills: reliability, infrastructure, Kubernetes, GCP, Node.js, PostgreSQL, Redis, observability, incident response, scalability
- Level: senior
- Type: full_time
Rubrik
Rubrik (NYSE: RBRK) is on a mission to secure the world’s data. With Zero Trust Data Security™, we help organizations achieve business resilience against cyber…
Palo Alto
- Skills: Site Reliability Engineering, Relational Databases, SQL, Kubernetes, Golang, Python, Java, Scalability, Disaster Recovery, FedRAMP
- Level: mid
- Type: full_time
Genmo
We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI.
San Francisco
- Skills: GPU clusters, Kubernetes operations, Infrastructure-as-Code, GitOps workflows, CI/CD pipelines, observability stack, high-performance networking, NVIDIA DCGM, containerized GPU stacks, distributed training
- Level: mid
- Type: full_time
Figma
Figma helps entire product teams brainstorm, design and build better products — from start to finish. Whether it’s consolidating tools, simplifying workflows, …
San Francisco
- Skills: Infrastructure, AWS, Distributed Systems, Telemetry, Scale, Reliability, Durability, Performance, Engineering, Software Development
- Level: mid
- Type: full_time
Anomali
Anomali is headquartered in Silicon Valley and is the Leading AI-Powered Security Operations Platform that is modernizing security operations. At the center of…
Redwood City
- Skills: Kubernetes, Terraform, CI/CD, AWS, New Relic, Python, Golang, EKS, Automation, Infrastructure as Code
- Level: mid
- Type: full_time
Column Tax
We’re building the next generation of tax software. Our mission is to make it possible for every taxpayer to file confidently in just one click. As the fastest…
San Francisco
- Skills: Infrastructure, AWS, GCP, Terraform, CI/CD, Observability, Cloud operations, On-call, System performance, DevOps
- Level: mid
- Type: full_time
Crusoe Energy Systems
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company, pioneering vertically integrated, purpose-built AI infrastructure solutions powe…
San Francisco
- Skills: Site Reliability Engineering, cloud storage, automation, distributed systems, Infrastructure as Code, Linux internals, Kubernetes, object storage, performance tuning, data replication
- Level: mid
- Type: full_time