The Global Cloud Site Reliability Engineering (SRE) team keeps Deltek Cloud Products working well and available to our customers 24 hours a day, 7 days a week.
A consistent leader in the Gartner magic quadrant for ERP solutions, Deltek Cloud Products have a rapidly growing customer base and with that an ever evolving cloud environment.
Our goal is to constantly improve the availability, reliability and daily operations of our Deltek Cloud Products and we are seeking Senior Cloud Site Reliability Engineers to join our Product Operations team.
As a SRE for Product Operations, you should have a strong passion in creating solutions that address complex and recurring issues;
put a great focus on customer success and aim to deliver a true SaaS experience.
Troubleshoot complex problems, provide software fault diagnosis, resolve operational issues, and performance bottlenecks.
Collaborate with Global SRE, Product Delivery, Product Engineering, and Customer Care teams in delivering a true SaaS experience to our customers 24x7
Provide consistent service availability by monitoring environment stability and performance through the use of the right metrics and tools
Perform day-to-day product operations like provisioning new customers, creating databases & schemas, database restores, configuring applications, patch management, systems administration.
Develop and manage automation that will reduce resource consuming tasks and manual processes
Drive problem management and capacity planning initiatives by monitoring alert trending and resource utilization
Document system architectures, systems configurations, and technical operational processes and policies.
Work within one of our 24x7 schedules (Sunday Thursday or Tuesday Saturday) and shifts (morning, mid, or night).
Participating in maintenance activities and on-call rotations as required.
Execute disaster recovery plans and reporting on metrics related to those activities.
Bachelor's degree in Computer Science field or equivalent.
5+ years supporting enterprise application platforms and systems at scale on public cloud infrastructure (Amazon Web Services is desired).
5+ years of experience with managing and operating enterprise-grade Windows or Linux production environments
3+ years of experience applying an automation first approach to problem solving leveraging scripting solutions (e.g. Bash, Python, PowerShell).
Experience with Incident Management and ITIL service operations (ServiceNow experience desired)
Experience with any of the following operations systems : AppDynamics, Splunk, PRTG, SolarWinds DPA, Nagios, NewRelic, PagerDuty.
Experience with basic database management tasks in Oracle or Microsoft SQL Server.
Passionate and curious about ways to leverage technology with self-led learning
Must be detail oriented, results driven, and have excellent English communication skills.
Ability to work effectively with in a team environment in and outside the organization to accomplish goals, objectives and to identify and resolve problems.
Bonus Skills and Experience :
Advanced understanding of high availability and disaster recovery strategies.
Experience with configuration management and orchestration (e.g., Terraform, Cloud Formation, Ansible).
Experience with continuous integration tools (e.g., GitHub, Azure DevOps)
Experience with the AWS CLI.
Hands-on experience using infrastructure-as-code, self-healing, security automation patterns.
Understanding of software development lifecycle (SDLC) and agile development.
Any experience with Deltek applications.