1+ months

Site Reliability Engineer (SRE)

New York, NY 10007


As a Site Reliability Engineer (SRE) on the Big Data Operations (BDO) Team, you will responsible for building, operating and supporting our heterogeneous Data Systems Platform in the Technical Operations group. The Data Systems Platform consists of large Hadoop, HBase, Kafka installations, several messaging platforms as well as real time data platforms. The platform currently ingests 200TB of new data and performs 20,000 ETL jobs every day across 5 Hadoop, 4 HBase and 6 Vertica Clusters.

About the Team:
The Technical Operations (TechOps) Team is distributed across the globe and handles a wide variety of responsibilities, from providing tech support to architecting long-range build-out and day-to-day operations at our six global data centers. We have well over 7,000 servers, which process over 1 million Ad Serving Requests per second (billions per day). We are in search of troubleshooters and those who love to tinker and innovate with technology.

About the Job:
* Monitor, maintain and provision components of the Data Systems Platform
* Perform software upgrades on the components of the Data Systems Platform
* Work with Data Engineering team to help design and implement next iteration of scaling, and evaluate Open Source and Commercial software and hardware solutions
* Work closely with the systems performance, systems operations, and network engineering teams as needed to ensure high performance and availability
* Develop and/or implement tools to automate aspects of supporting, maintain and build the Data Systems Platform, including upgrades where appropriate
* Participate in prototyping and proof-of-concept system development and bench-marking
* Support, maintain and build storage restructuring
* Participate in on-call rotation responding to alerts and systems issues
* Operate user access and resource allocations to Data Systems Platform


* 5+ years of relevant experience in implementing, troubleshooting, and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals
* 5+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell, Go, Perl, Java, C
* 3+ years of relevant experience for all of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka
* 3+ years of relevant experience with Puppet, Chef, Ansible or equivalent configuration management tool
* 2+ years of relevant experience with TCP/IP networking (DNS, DHCP, HTTP etc.
* Strong written and oral communication skills with the ability to interface with technical and non-technical stakeholders at various levels of the organization

Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):
* Experience with JVM and GC tuning is a plus
* Regular expression fluency
* Experience with Nagios or similar monitoring tools
* Experience with data collection/graphing tools like Cacti, Ganglia, Graphite and Grafana
* Experience with tcpdump, ethereal, tshark and other packet capture and analysis tools

More About You:
* You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others
* You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen
* You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem
* You believe in not only serving customers, but also empowering them by providing knowledge and tools


Posted: 2020-03-30 Expires: 2020-06-24

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Site Reliability Engineer (SRE)

New York, NY 10007

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast