Site Reliability Engineer

Company: Lawrence Berkeley National Laboratory
Location: Berkeley
Posted on: May 3, 2025

Job Description:

Lawrence Berkeley National Lab's (LBNL) NERSC Division has an opening for a Site Reliability Engineer to join the team.

The National Energy Research Scientific Computing Center (NERSC) is inviting applications for the position of Site Reliability Engineer. NERSC's mission is to accelerate scientific discovery through high performance computing and data analysis for the DOE Office of Science programs. NERSC provides critical HPC and data systems and support for NERSC's 10,000 users researching alternative energy sources, climate science, energy efficiency, environmental science and other DOE mission areas. As a Site Reliability Engineer in the Operations Group you will be a member of a 24x7 team that helps ensure that NERSC is accessible, reliable, secure and available to our scientific users using our state of the art OMNI data collection and monitoring system.

What You Will Do at Level 2:

Work 5 shifts per week to monitor the NERSC HPC Facility, which includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be onsite, some may be offsite. The schedule will be determined by staffing needs.
Review and respond to alerts from computer systems, storage, network, and other data center/facility related systems by triaging or calling appropriate on-call staff.

Create appropriate solutions to improve a process and to prevent issue recurrence and automate the response to all routine service conditions.
Identify issues and propose solutions that will improve the ability to monitor or provide better automation for monitoring or triage.

Respond to alerts from OMNI to ensure that the system continues to collect data 24x7 to provide real time information for diagnoses.
Develop and maintain tools within the monitoring pipeline in collaboration with the Operations Team.

Create new software programs to provide alerts and notifications from the HPC system APIs and into the monitoring pipeline.
Create new software configurations and solve technical issues to enable programs to scale to more dense data or to deliver at scale reliably.
Collaborate with other groups at NERSC to ensure that communication and workflows are clearly understood. Assign technical tasks to other Operations monitoring team members to ensure that the system is being monitored according to agreed upon standards.

Work closely with other NERSC groups to coordinate center-wide maintenance activities and to manage diagnostic and notification software during maintenances.
Provide accurate information in the trouble ticketing system for outages, maintenance updates, and other incidents such that the workflow and protocols can be appropriately tracked by others.
Work on and resolve problems of diverse scope where analysis of data requires evaluation of identifiable factors.

In Addition to Above, What You Will Do at Level 3:

Provide leadership in developing OMNI monitoring and alerting pipelines for all aspects of the data center, documentation, and software development.
Contribute to the design and deployment of the OMNI cluster
Work closely with other groups and OMNI to help build a better monitoring experience.
Work on and resolve complex issues where analysis of situations or data requires an in-depth evaluation of variable factors.
Determine methods and procedures on new assignments and may coordinate activities of other personnel.

What is Required at Level 2:

Typically requires a minimum of 5 years of related experience with a Bachelor's degree; or 3 years and a Master's degree; or equivalent work experience.
Strong hands-on knowledge of the Linux shell and working in a command-line (e.g.SSH) environment.
Experience with developing tools using various programming languages such as C, C++, perl, java, or Python or a scripting language with knowledge of standard software development practices
Knowledge of and ability to work on large data communications networks and IT infrastructure supporting highly available systems and applications.
Motivated, self-starter who can learn technologies that improve data center management in areas like kubernetes, Prometheus/VictoriaMetrics, alertmanager, building management software, evaporative cooling, and power utilization
Strong communication skills and ability to work effectively across multiple technical teams.
Experience working in a 24/7 onsite team managing large data centers or other large installations.
Experience with network security: configuring/maintaining ACLs, knowledge of firewalls
Understanding of networks and network protocols.
A certification in a system administration area in platforms, software, or any other advanced education in the Computing Science area.

In Addition to Above, What is Required at Level 3:

Typically requires a minimum of 8 years of related experience with a Bachelor's degree; or 6 years and a Master's degree; or equivalent experience.
Expertise in a programming language such as C, C++, perl, java, or Python
Demonstrated excellence in any of the tools mentioned in this listing
Experience leading technical projects
The ability to respond proactively to problems and issues.

Notes:

This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
Shift: Owl shift 12AM to 8AM (on-site).
Level 2: The full salary range of this position is between $109,152 to $184,200 per year and is expected to pay between a targeted range of $122,784 to $150,096 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience.
Level 3: The full salary range of this position is between $129,948 to $219,276 per year and is expected to pay between a targeted range of $146,184 to $178,668 per year depending upon candidates' full skills, knowledge, and abilities, including education, certifications, and years of experience.
This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
This position requires substantial on-site presence, but is eligible for a hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.

Want to learn more about working at Berkeley Lab? Please visit: careers.lbl.gov

Equal Employment Opportunity Employer: The foundation of Berkeley Lab is our Stewardship Values: Team Science, Service, Trust, Innovation, and Respect; and we strive to build community with these shared values and commitments. Berkeley Lab is an Equal Opportunity and Affirmative Action Employer. We heartily welcome applications from all who could contribute to the Lab's mission of leading scientific discovery, inclusion, and professionalism. In support of our rich global community, all qualified applicants will be considered for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, age, or protected veteran status.

Misconduct Disclosure Requirement: As a condition of employment, the finalist will be required to disclose if they are subject to any final administrative or judicial decisions within the last seven years determining that they committed any misconduct, are currently being investigated for misconduct, left a position during an investigation for alleged misconduct, or have filed an appeal with a previous employer.

Keywords: Lawrence Berkeley National Laboratory, Elk Grove , Site Reliability Engineer, Engineering , Berkeley, California

Click here to apply!

Didn't find what you're looking for? Search again!

Let Berkeley recruiters find you. Post your resume for free!

Get Berkeley Engineering jobs via email.

View more Elk Grove Engineering jobs

Other Engineering Jobs

Office Administrator
Description: FranklinWH is a rapidly expanding company determined to become the leader in the global energy home storage system ESS industry. FranklinWH offers whole home energy solutions typically coupled with (more...)
Company: Franklin Whole Home
Location: San Jose
Posted on: 05/5/2025

Commercial Sales Consultant- Base + Commission
Description: Ready for your next challenge Do you thrive in a competitive environment If you enjoy planning, developing, and implementing all aspects of the Commercial Account Management process we want to meet you (more...)
Company: Bath Fitter
Location: San Jose
Posted on: 05/5/2025

STRATEGIC ACCOUNT EXECUTIVE - BUSINESS DEVELOPMENT-SAN JOSE
Description: Career Opportunities with California Hydronics CorpA great place to work.Current job opportunities are posted here as they become available.STRATEGIC ACCOUNT EXECUTIVE - BUSINESS DEVELOPMENT-SAN JOSEJob (more...)
Company: California Hydronics Corp
Location: San Jose
Posted on: 05/5/2025

Salary in Elk Grove, California Area | More details for Elk Grove, California Jobs |Salary

Office Manager - Curriculum & Instruction (TK-5 Department)
Description: The impact you will have: As an Office Manager you will support the day-to-day operations of your department managing department budgets and processes, serving as the first point of contact for inquiries, (more...)
Company: ReNEW Schools
Location: San Jose
Posted on: 05/5/2025

Associate Director of Resources and Operations
Description: Associate Director of Resources and OperationsJob SummaryReporting to the College of Science CoS Director of Resources and Operations, and working closely with the Executive Director of Moss Landing (more...)
Company: The California State University
Location: San Jose
Posted on: 05/5/2025

Community Lending Officer - Willow Glen
Description: At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. Responsible Growth is how we run our company and how we deliver for our (more...)
Company: Bank of America
Location: San Jose
Posted on: 05/5/2025

Security Operations Sales Engineer, CA
Description: Security Operations Sales Engineer, CAOur Client is redefining security operations with a proactive, shift-left approach. Their Adaptive MDR solution, powered by Resolution Intelligence Cloud technology, (more...)
Company: Planet Green Search
Location: San Jose
Posted on: 05/5/2025

Staff Engineer, AI/ML Software Compiler
Description: To provide the best candidate experience amidst our high application volumes, each candidate is limited to 10 applications across all open jobs within a 6-month period.Advancing the World's Technology (more...)
Company: Conductor
Location: San Jose
Posted on: 05/5/2025

Intern, AI/ML Software Engineer
Description: To provide the best candidate experience amidst our high application volumes, each candidate is limited to 10 applications across all open jobs within a 6-month period.Advancing the World's Technology (more...)
Company: Conductor
Location: San Jose
Posted on: 05/5/2025

Occupational Therapist - Outpatient - (OT)
Description: Job Description RequirementsOccupational Therapist - Outpatient - OT StartDate: ASAP Available Shifts: 8 D Pay Rate: 1926.00 - 2219.00Well respected Outpatient Clinic is seeking a Occupational Therapist (more...)
Company: Amn Healthcare
Location: San Jose
Posted on: 05/5/2025

Loading more jobs...

Site Reliability Engineer

Didn't find what you're looking for? Search again!

Other Engineering Jobs

Log In or Create An Account