Site Reliability Engineer
Company: Lawrence Berkeley National Laboratory
Location: Berkeley
Posted on: May 3, 2025
|
|
Job Description:
Lawrence Berkeley National Lab's (LBNL) NERSC Division has an
opening for a Site Reliability Engineer to join the team.
The National Energy Research Scientific Computing Center (NERSC) is
inviting applications for the position of Site Reliability
Engineer. NERSC's mission is to accelerate scientific discovery
through high performance computing and data analysis for the DOE
Office of Science programs. NERSC provides critical HPC and data
systems and support for NERSC's 10,000 users researching
alternative energy sources, climate science, energy efficiency,
environmental science and other DOE mission areas. As a Site
Reliability Engineer in the Operations Group you will be a member
of a 24x7 team that helps ensure that NERSC is accessible,
reliable, secure and available to our scientific users using our
state of the art OMNI data collection and monitoring system.
What You Will Do at Level 2:
Work 5 shifts per week to monitor the NERSC HPC Facility, which
includes 2 - 3 OWL (midnight - 8am) shifts. Some days may be
onsite, some may be offsite. The schedule will be determined by
staffing needs.
Review and respond to alerts from computer systems, storage,
network, and other data center/facility related systems by triaging
or calling appropriate on-call staff.
Create appropriate solutions to improve a process and to prevent
issue recurrence and automate the response to all routine service
conditions.
Identify issues and propose solutions that will improve the ability
to monitor or provide better automation for monitoring or
triage.
Respond to alerts from OMNI to ensure that the system continues to
collect data 24x7 to provide real time information for
diagnoses.
Develop and maintain tools within the monitoring pipeline in
collaboration with the Operations Team.
Create new software programs to provide alerts and notifications
from the HPC system APIs and into the monitoring pipeline.
Create new software configurations and solve technical issues to
enable programs to scale to more dense data or to deliver at scale
reliably.
Collaborate with other groups at NERSC to ensure that communication
and workflows are clearly understood. Assign technical tasks to
other Operations monitoring team members to ensure that the system
is being monitored according to agreed upon standards.
Work closely with other NERSC groups to coordinate center-wide
maintenance activities and to manage diagnostic and notification
software during maintenances.
Provide accurate information in the trouble ticketing system for
outages, maintenance updates, and other incidents such that the
workflow and protocols can be appropriately tracked by others.
Work on and resolve problems of diverse scope where analysis of
data requires evaluation of identifiable factors.
In Addition to Above, What You Will Do at Level 3:
Provide leadership in developing OMNI monitoring and alerting
pipelines for all aspects of the data center, documentation, and
software development.
Contribute to the design and deployment of the OMNI cluster
Work closely with other groups and OMNI to help build a better
monitoring experience.
Work on and resolve complex issues where analysis of situations or
data requires an in-depth evaluation of variable factors.
Determine methods and procedures on new assignments and may
coordinate activities of other personnel.
What is Required at Level 2:
Typically requires a minimum of 5 years of related experience with
a Bachelor's degree; or 3 years and a Master's degree; or
equivalent work experience.
Strong hands-on knowledge of the Linux shell and working in a
command-line (e.g.SSH) environment.
Experience with developing tools using various programming
languages such as C, C++, perl, java, or Python or a scripting
language with knowledge of standard software development
practices
Knowledge of and ability to work on large data communications
networks and IT infrastructure supporting highly available systems
and applications.
Motivated, self-starter who can learn technologies that improve
data center management in areas like kubernetes,
Prometheus/VictoriaMetrics, alertmanager, building management
software, evaporative cooling, and power utilization
Strong communication skills and ability to work effectively across
multiple technical teams.
Experience working in a 24/7 onsite team managing large data
centers or other large installations.
Experience with network security: configuring/maintaining ACLs,
knowledge of firewalls
Understanding of networks and network protocols.
A certification in a system administration area in platforms,
software, or any other advanced education in the Computing Science
area.
In Addition to Above, What is Required at Level 3:
Typically requires a minimum of 8 years of related experience with
a Bachelor's degree; or 6 years and a Master's degree; or
equivalent experience.
Expertise in a programming language such as C, C++, perl, java, or
Python
Demonstrated excellence in any of the tools mentioned in this
listing
Experience leading technical projects
The ability to respond proactively to problems and issues.
Notes:
This is a full-time, career appointment, exempt (monthly paid) from
overtime pay.
Shift: Owl shift 12AM to 8AM (on-site).
Level 2: The full salary range of this position is between $109,152
to $184,200 per year and is expected to pay between a targeted
range of $122,784 to $150,096 per year depending upon candidates'
full skills, knowledge, and abilities, including education,
certifications, and years of experience.
Level 3: The full salary range of this position is between $129,948
to $219,276 per year and is expected to pay between a targeted
range of $146,184 to $178,668 per year depending upon candidates'
full skills, knowledge, and abilities, including education,
certifications, and years of experience.
This position is subject to a background check. Any convictions
will be evaluated to determine if they directly relate to the
responsibilities and requirements of the position. Having a
conviction history will not automatically disqualify an applicant
from being considered for employment.
This position requires substantial on-site presence, but is
eligible for a hybrid schedules may be considered. Hybrid work is a
combination of performing work on-site at Lawrence Berkeley
National Lab, 1 Cyclotron Road, Berkeley, CA and some telework.
Individuals working a hybrid schedule must reside within 150 miles
of Berkeley Lab. Work schedules are dependent on business needs. In
rare cases, full-time telework or remote work modes may be
considered.
Want to learn more about working at Berkeley Lab? Please visit:
careers.lbl.gov
Equal Employment Opportunity Employer: The foundation of Berkeley
Lab is our Stewardship Values: Team Science, Service, Trust,
Innovation, and Respect; and we strive to build community with
these shared values and commitments. Berkeley Lab is an Equal
Opportunity and Affirmative Action Employer. We heartily welcome
applications from all who could contribute to the Lab's mission of
leading scientific discovery, inclusion, and professionalism. In
support of our rich global community, all qualified applicants will
be considered for employment without regard to race, color,
religion, sex, sexual orientation, gender identity, national
origin, disability, age, or protected veteran status.
Misconduct Disclosure Requirement: As a condition of employment,
the finalist will be required to disclose if they are subject to
any final administrative or judicial decisions within the last
seven years determining that they committed any misconduct, are
currently being investigated for misconduct, left a position during
an investigation for alleged misconduct, or have filed an appeal
with a previous employer.
Keywords: Lawrence Berkeley National Laboratory, Elk Grove , Site Reliability Engineer, Engineering , Berkeley, California
Click
here to apply!
|