The Senior Site Reliability Engineer will provide technical leadership to the Service Platform Operations Team as we configure, integrate, deploy, validate, monitor, and support services and applications on the PlayStation Network. Responsibilities include:
Hands-on application management and support for AWS cloud and on-prem production environments, including full-stack diagnosis, fault resolution and root cause analysis.
Proactive monitoring of production systems and identify issues before service impact.
Drive and Implement monitoring tools/metrics/reports for tracking application/service performance.
Collaborate with engineering and system teams to drive changes and ensure optimal application performance and resiliency.
Lead service and system performance analysis, service capacity planning, and service continuity validation for multiple applications.
Implement automated scripts/tools to automate operational tasks/activities.
Review and influence design, architecture, standards, and methods for deploying, monitoring and operating services and applications.
Actively participate and/or commit in the execution of tasks required to meet milestones and deliverables set by the SCRUM team throughout the release cycle.
Provide rotational on-call support.
A minimum of 5 years supporting large scale multi-tiered web environments running complex Java-based applications is required. 3 years working as a “lead” engineer is highly preferred. Candidate must possess the following:
BS degree in Computer Science, Engineering, or related technical discipline.
5 years hands-on Linux experience (RHEL or CentOS preferred).
3 years of relevant work experience in a high-volume and/or critical production environment.
2 years hands-on AWS experience – Deploying, Supporting, and managing applications (sysops).
Proficient in using the typical Linux toolbox of open source software and management tools.
Experience with log management tools, e.g. Splunk, Logstash, Kibana.
Exceptional scripting skills (python, shell, golang).
Hands-on experience in troubleshooting and performance tuning of Java applications.
Solid understanding of networking systems and protocols – HTTP, TCP/IP, SSL, DNS.
Experience with automation/configuration management using Jenkins, Ansible, Puppet, Chef or similar tool.
Experience with Agile SCRUM development methodologies, Continuous Integration and Continuous Delivery (CI/CD).
Experience in quality control and validating services in a production environment.