1+ months

Sr Site Reliability Engineer (Remote)

Dallas, TX 75219 Work Remotely
The Resiliency Lead Architect will be responsible for partnering with the various Consumer Technology Platform (CTP), Chief Technology Information Office (CTIO), Operations/Infrastructure, and network teams in implementing a comprehensive resiliency engineering framework. The architect will be responsible for planning, designing, and rolling out proactive resiliency practices which protect customer journeys from disruption and avoid re-engineering costs through the early detection of existing and emerging resiliency threats. The successful candidate will be a strong technologist who is flexible, resilient, an innovative thinker, as well as a natural collaborator with solution architects, software engineers, developers and senior management from across the organization. The Resiliency Architect is expected to lead through influence, communicate effectively through clarity of thought and demonstrated understanding of business and technical requirements. In addition, the candidate must possess strong technical leadership skills and demonstrated success in working with teams particularly in a matrix fashion.


**_Key Responsibilities:_**


+ Design and roll out robust impact assessment framework that will validate impact of changes to performance of individual applications as well as the consumer technology ecosystem

+ Design, develop and implement chaos engineering practices for the consumer technology ecosystem

+ Work with performance architects to design performance tests based on customer journeys that will be used to validate performance and resiliency of the consumer technology ecosystem

+ Collaborate with operations and application engineering teams to design and execute production game day scenarios that will help enhance emergency response processes

+ Provide key SME leadership within Consumer Quality Engineering (CQE) team on resiliency programs and initiatives

+ Work closely with LOB Security architects and GTI infrastructure technologists to develop remediation solutions, where appropriate

+ Ensure all implemented resiliency solutions have validation plans in place including continuous improvement plans

+ Define and implement post-mortem / root-cause analysis processes develop improved testing scenarios based upon analysis

+ Develop requirements to enhance observability of performance visuals, implement telemetry controls, and consult on self-healing capabilities for identified/prioritized failure scenarios

+ Design self-healing and resiliency patterns


+ Experience with development technology stack Programming tools like Docker, Python, Django, Celery, Postgres is a must


+ 10+ years of strong hands-on experiences and technical depth in one, or more technology areas, including software engineering, solution architecture, production operations, distributed technologies, performance engineering, resiliency/chaos engineering, or cloud based ecosystems.

+ Experience with microservice architecture and containerization technologies like Docker and Kubernetes.

+ Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks).

+ Knowledge of application architecture concepts, including topology, protocols, components, and principles would be advantages

+ Some Programming experiences in one or more languages (scripting/functional/imperative -- C/C++, Java, Python, Scala, R, SQL, etc.) would be advantages

+ Proven leader with successful track record architecting and rolling out technology transformation initiatives

+ Strength in both business and technical requirements analysis

+ Strong written and verbal communication skills

+ Ability to think strategically about how to create firm wide solutions to business requirements and ability to communicate effectively to both business and technical audiences

+ Ability to orchestrate and drive complex strategies and solutions

+ Proven ability to build strong, cohesive partnerships with the business, operations, technology & other key stakeholders, including external vendor partners, and work effectively in a matrix organization.

+ Superior analytical and problem solving skills

+ Working knowledge of the following technologies Kubernetes Container, CI/CD, Jenkins, Chaos Testing

+ Fault domain analysis experience for both Core Infrastructure services and modern micro segmented application designs

+ Subject matter expert in business/service continuity, availability, disaster recovery and/or similar topics
We expect employees to be honest, trustworthy, and operate with integrity. Discrimination and all unlawful harassment (including sexual harassment) in employment is not tolerated. We encourage success based on our individual merits and abilities without regard to race, color, religion, national origin, gender, sexual orientation, gender identity, age, disability, marital status, citizenship status, military status, protected veteran status or employment status.

Categories

Posted: 2021-01-25 Expires: 2021-04-22

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Sr Site Reliability Engineer (Remote)

AT&T
Dallas, TX 75219

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast