Please enable cookies in your browser to experience all the personalized features of this site, including the ability to apply for a job.
TechOps Support Engineer
4 months ago(10/08/2018 10:45)
Amazon Data Services Ireland Limited
Company/Location (search) : Country (Full Name)
Amazon Web Services’ Technical Operations team (TechOps) is Amazon’s central defense against large-scale incidents as well as driving operational excellence across all of Amazon businesses. Our key offering to Amazon is best-in-class Incident Management. Our engineers are front-and-center in driving down event duration through experience in operational excellence, best current practices and incident management tools. We’re looking for engineers who have owned or participated in operational and incident management for at least one large-scale enterprise. You should have a passion for working with new technologies and are not afraid to exercise your creativity in pushing the boundaries of existing technologies. Running incident management for AWS is unique in that AWS supports more than 30% of the internet’s businesses, and our ability to identify and mitigate issues is the most important aspect of every Amazon employee. Because of our unique role, you will have limitless exposure to all things Amazon. TechOps engineers are encouraged to build solutions to problems while sharing the benefit of those solutions with other AWS service teams. This is an excellent opportunity to join one of Amazon’s world-class team of engineers, and work with some of the best and brightest while also developing your skills and career within one of the most dynamic, innovative and progressive technology companies anywhere. In addition to a stimulating and fun working environment, Amazon offers mentoring programs with experienced engineers, regular tech talks with technology Principals, and well-defined career paths for motivated engineers who want to contribute to our culture of operational excellence and customer-focused technical innovation.
Responsibilities • Provide critical support, incident response, and management to internal customers across all of Amazon including management of communications and coordination of service owners via conference calls • Be a technology evangelist and use your deep knowledge to solve business problems • Reduce mean time to resolution for all incident types • Update and/or build world class listening systems • Participate in Agile sprints to evolve business processes and technologies • Get there first; be the first to detect and diagnose high-severity service-impacting events • Identify and troubleshoot recurring platform issues and engage service owners to assist with resolution • Automate tasks through creation and maintenance of scripts and tools • Respond to and complete customer requests within SLA via a trouble ticketing system • Take part in a “follow the sun” rotation split between Seattle, Dublin and Sydney sites, including weekends and holidays • Create and review documentation, design new standard operating procedures • Mentor peers in your areas of technical and operational strength • Participate in the interviewing process
If this sounds like the right challenge for you, then please apply today!
• 3 years’ experience in a large-scale software development environment • Proficiency in Java, C/C++/C# or another high-level programming language • Experience with distributed operational health and performance monitoring systems • Manage directly assigned tasks and on-call duties gracefully • Ability to work in a diverse team environment Experience specifying, designing, and/or implementing system health, performance monitoring tools · Experience designing and/or implementing automated software testing, deployment and performance analysis systems · Experience conducting failure mode analysis in complex distributed systems · Experience conducting efficiency and duplication analysis across large organizations · Experience reviewing and refining design and architecture documents presented by partner teams for operational readiness, fault tolerance and scalability
Required: • A degree in Computer Science or at least two years relevant experience in a large-scale online technical operations environment • Excellent English language written and verbal communication skills to facilitate efficient and effective interaction with peers and customers • Confidence to initiate, drive, and manage company-wide conference calls • Effective organizational skills to maintain a consistently high standard of operations in a busy environment • Knowledge of the Linux operating system and good understanding of networking concepts • Excellent troubleshooting skills and a commitment to document findings
Highly desirable: • Knowledge of best current practice frameworks (ITIL, COBIT), particularly incident, problem and change management • Development/scripting skills in at least one interpreted language (e.g. Perl/Python/Ruby) as well as shell. Working knowledge of a compiled language is a plus • Understanding of routing protocols to help facilitate troubleshooting and remediation of networking issues • Experience in Agile/Scrum or related collaborative workflow