Last updated: 2025-10-24
93 Site Reliability Engineering jobs in San Jose.
NewsBreak
NewsBreak is redefining the way users interact with local news and their communities. By bridging local users, local content creators, and local businesses, ou…
Mountain View
- Skills: AWS, Kubernetes (EKS), EMR (Elastic MapReduce), service reliability, fault-tolerant architectures, Infrastructure-as-Code (IaC), CI/CD pipelines, monitoring tools (Prometheus, Grafana), high-availability strategies, incident response
- Level: mid
- Type: full_time
Luma AI
Palo Alto
- Skills: Site Reliability Engineer, SRE, Infrastructure, GPU clusters, H100 GPUs, Monitoring tools, Management tools, Performance problems, Maintenance problems, Data Processing
- Level: mid
- Type: full_time
Replit
Replit is the fastest way to turn ideas into software. With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural languag…
Foster City
- Skills: Site Reliability Engineering, SRE, Infrastructure Automation, Monitoring Solutions, Infrastructure as Code, CI/CD Pipelines, Incident Management, Performance Optimization, Distributed Systems, Cloud-native Technologies
- Level: mid
- Type: full_time
Coupang
Coupang is a leading force in South Korean commerce, known for its exceptional customer service and innovative approach to retail and e-commerce. The company b…
Mountain View
- Skills: observability solutions, monitoring, alerting, logging, tracing, Kubernetes, DevOps, SRE practices, cloud-based infrastructure, performance indicators
- Level: mid
- Type: full_time
Palo Alto Networks
Palo Alto Networks is a cybersecurity company that offers advanced firewalls and cloud-based security services to secure the digital transformation.
Santa Clara
- Skills: DevOps, Site Reliability Engineering, Cortex, Security, Engineering Management, Cloud, Platforms, Production Operations, AI, Software Development
- Level: mid
- Type: full_time
NetApp
NetApp is the intelligent data infrastructure company, turning a world of disruption into opportunity for every customer. No matter the data type, workload or …
San Jose
- Skills: Cloud, Software Engineering, SRE, Incident Management, Observability, Application Security, Python, Golang, DevSecOps, Virtualization
- Level: mid
- Type: full_time
Celonis
Celonis helps some of the world’s largest and most esteemed brands make processes work for people, companies, and the planet. With over 5,000 enterprise custom…
Redwood City
- Skills: Site Reliability Engineering, SRE principles, observability, automation, incident prevention, cloud platforms, Java, Python, Kubernetes, error budgets
- Level: senior
- Type: full_time
Rubrik
Rubrik (NYSE: RBRK) is on a mission to secure the world’s data. With Zero Trust Data Security™, we help organizations achieve business resilience against cyber…
Palo Alto
- Skills: Site Reliability Engineering, Relational Databases, SQL, Kubernetes, Golang, Python, Java, Scalability, Disaster Recovery, FedRAMP
- Level: mid
- Type: full_time
Anomali
Anomali is headquartered in Silicon Valley and is the Leading AI-Powered Security Operations Platform that is modernizing security operations. At the center of…
Redwood City
- Skills: Kubernetes, Terraform, CI/CD, AWS, New Relic, Python, Golang, EKS, Automation, Infrastructure as Code
- Level: mid
- Type: full_time
Glean
Glean is an innovative AI-powered knowledge management platform designed to help organizations quickly find, organize, and share information across their teams.
Palo Alto
- Skills: Site Reliability Engineering, Cloud Infrastructure, Automation, Monitoring, Docker, Kubernetes, Google Cloud Platform, AWS, Terraform, Performance Optimization
- Level: senior
- Type: full_time
Ridgeline
Ridgeline is the industry cloud platform for investment management. Founded by visionary tech entrepreneur Dave Duffield to solve operational business challeng…
San Ramon
- Skills: Site Reliability Engineering, cloud-native, FinOps, AI-assisted automation, observability infrastructure, Infrastructure-as-Code, CI/CD systems, incident triage, zero-downtime deployments, cost visibility
- Level: senior
- Type: full_time
Sunnyvale
- Skills: Site Reliability Engineering, software development, large-scale systems, distributed systems, automation, performance, scalability, team management, technical leadership, complex challenges
- Level: mid
- Type: full_time
Contextual AI
Mountain View
- Skills: Machine Learning, Infrastructure, Distributed Systems, Cloud Infrastructure, Observability, Python, Kubernetes, Terraform, CI/CD, Reliability Engineering
- Level: mid
- Type: full_time
xAI
xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motiv…
Palo Alto
- Skills: site reliability engineering, exascale storage systems, data management, Kubernetes, security measures, Rust, Go, cloud infrastructure, Terraform, AI research
- Level: mid
- Type: full_time
Celonis
Celonis, the global leader in Process Mining technology, aims to unlock productivity by placing data and intelligence at the core of business processes.
Redwood City
- Skills: Process Mining, Site Reliability Engineering, Cloud-based Applications, Kubernetes, Java, Python, AWS, Azure, GCP, SRE Principles
- Level: mid
- Type: full_time
Crusoe
Crusoe is building the World’s Favorite AI-first Cloud infrastructure company. We’re pioneering vertically integrated, purpose-built AI infrastructure solution…
Sunnyvale
- Skills: Site Reliability Engineer, Cloud Infrastructure, Distributed Storage Systems, Automation, Performance Tuning, Fault-tolerant Systems, I/O Subsystems, Kubernetes, Infrastructure as Code, AI Workloads
- Level: mid
- Type: full_time
Tenable, Inc.
Tenable® is the Exposure Management company. 44,000 organizations around the globe rely on Tenable to understand and reduce cyber risk. Our global employees su…
San Jose
- Skills: Site Reliability Engineering, Terraform, FedRAMP, AWS, Kubernetes, Docker, Agile, Cloud Infrastructure, Microservices, Security
- Level: mid
- Type: full_time
Hippocratic AI
Hippocratic AI has developed a safety-focused Large Language Model (LLM) for healthcare. The company believes that a safe LLM can dramatically improve healthca…
Palo Alto
- Skills: GCP, Kubernetes, infrastructure automation, Docker, Terraform, monitoring, Jenkins, cloud platforms, DevOps, security compliance
- Level: mid
- Type: full_time
Sustainable Talent
Sustainable Talent is partnering with Nvidia, a global leader in transforming computer graphics, PC gaming, and accelerated computing for over 25 years.
Santa Clara
- Skills: Platform Reliability, Lab Support, Cloud Infrastructure, Data Centers, DevOps, Software Validation, Unix, Windows, Networking, Automation
- Level: mid
- Type: full_time
Etched
Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of ma…
San Jose
- Skills: ASIC, HPC, Infrastructure-as-Code, CI/CD, Telemetry, Prometheus, Kubernetes, Cloud, Artificial Intelligence, Observability
- Level: mid
- Type: full_time