The Global Cloud Site Reliability Engineering (SRE) team protects Deltek Cloud SaaS Products and its customers 24 hours a day, 7 days a week.
A consistent leader in the Gartner magic quadrant for ERP solutions, our products have a rapidly growing customer base and with that an ever evolving cloud environment.
Deltek is seeking a Senior Cloud Site Reliability Engineer to join our Product Operations Team focused on monitoring cloud service availability, incident management, and supporting day-to-day product operations of our Cloud SaaS offerings.
We are looking for customer focused team members passionate about solving complex technical challenges and delivering a best-in-class service experience.
Troubleshoot complex problems, provide software fault diagnosis, resolve operational issues, and performance bottlenecks.
Collaborate with Global SRE, Product Delivery, Product Engineering, and Customer Care teams in delivering a true Cloud SaaS experience to our customers 24x7.
Ensure consistent service availability by monitoring our environments’ stability and performance using the right metrics and tooling.
Perform day-to-day product operations like provisioning new customers, creating databases & schemas, database restores, configuring applications, patch management, systems administration.
Incident and Problem Management Execute incident response plays, lead major incident bridges, and participate in post-incident review process for incident prevention.
Develop and manage automation to reduce manual processes and tasks to realize operational efficiencies /
Drive capacity planning by monitoring system resource utilization, errors, and alerts trends.
Document system architectures, systems configurations, and technical operational processes and policies.
Work within one of our 24x7 schedules (Sunday Thursday or Tuesday Saturday) and shifts (morning, mid, or night).
Participating in maintenance activities and on-call rotations as required.
Execute disaster recovery plans and reporting on metrics related to those activities.
Bachelor's degree in Computer Science field or equivalent. Master’s degree preferred.
5+ years supporting enterprise application platforms and systems at scale on public cloud infrastructure (Amazon Web Services is desired).
5+ years of experience with managing and operating enterprise-grade Windows or Linux production environments
3+ years of experience applying an automation first approach to problem solving leveraging configuration management tools and scripting (e.
g., Bash, Python, PowerShell).
Experience with Incident Management and ITIL service operations (ServiceNow experience desired)
Experience with any of the following operations systems : AppDynamics, Splunk, PRTG, SolarWinds DPA, Nagios, NewRelic, PagerDuty.
Experience with basic database management tasks in Oracle or Microsoft SQL Server.
Passionate and curious about ways to leverage technology with self-directed learning
Must be detail oriented, results driven, and have excellent English communication skills.
Ability to work effectively with in a team environment in and outside the organization to accomplish goals, objectives and to identify and resolve problems.
Bonus Skills and Experience :
Advanced understanding of high availability and disaster recovery strategies.
Experience with configuration management and orchestration (e.g., Terraform, Cloud Formation, Ansible).
Experience with continuous integration tools (e.g., GitHub, Azure DevOps)
Experience with the AWS CLI.
Hands-on experience using infrastructure-as-code, self-healing, security automation patterns.
Understanding of software development lifecycle (SDLC) and agile development.
Any experience with Deltek applications.