Skip to Main Content

Site Reliability Engineer (SRE) - Big Data

New York, New York

Apply

Description

As a Site Reliability Engineer (SRE) on the Big Data Operations (BDO) Team, you will responsible for building, operating and supporting our heterogeneous Data Systems Platform in the Technical Operations group. The Data Systems Platform consists of large Hadoop, HBase, Kafka installations, several messaging platforms as well as real time data platforms. The platform currently ingests 200TB of new data and performs 20,000 ETL jobs every day across 5 Hadoop, 4 HBase and 6 Vertica Clusters.


About the Team:


The Technical Operations (TechOps) Team is distributed across the globe and handles a wide variety of responsibilities, from providing tech support to architecting long-range build-out and day-to-day operations at our six global data centers. We have well over 7,000 servers, which process over 1 million Ad Serving Requests per second (billions per day). We are in search of troubleshooters and those who love to tinker and innovate with technology.


About the Job:


• Monitor, maintain and provision components of the Data Systems Platform

• Perform software upgrades on the components of the Data Systems Platform

• Work with Data Engineering team to help design and implement next iteration of scaling, and evaluate Open Source and Commercial software and hardware solutions

• Work closely with the systems performance, systems operations, and network engineering teams as needed to ensure high performance and availability

• Develop and/or implement tools to automate aspects of supporting, maintain and build the Data Systems Platform, including upgrades where appropriate

• Participate in prototyping and proof-of-concept system development and benchmarking

• Support, maintain and build storage restructuring

• Participate in on-call rotation responding to alerts and systems issues

• Operate user access and resource allocations to Data Systems Platform


Qualifications

• 5+ years of relevant experience in implementing, troubleshooting, and supporting the Unix/Linux operating system with concrete knowledge of system administration/internals

• 5+ years of relevant experience in scripting/writing/modifying code for monitoring/deployment/automation in one of the following (or comparable): Python, Shell, Go, Perl, Java, C

• 3+ years of relevant experience for all of the following technologies: Hadoop-HDFS, Yarn-MapReduce, HBase, Kafka

• 3+ years of relevant experience with Puppet, Chef, Ansible or equivalent configuration management tool

• 2+ years of relevant experience with TCP/IP networking (DNS, DHCP, HTTP etc.)


Beneficial skills and experience (if you don’t have all of them, you can learn them at Xandr):


• Experience with JVM and GC tuning is a plus

• Regular expression fluency

• Experience with Nagios or similar monitoring tools

• Experience with data collection/graphing tools like Cacti, Ganglia, Graphite and Grafana

• Experience with tcpdump, ethereal, tshark and other packet capture and analysis tools


More About You:


• You are passionate about a culture of learning and teaching. You love challenging yourself to constantly improve, and sharing your knowledge to empower others

• You like to take risks when looking for novel solutions to complex problems. If faced with roadblocks, you continue to reach higher to make greatness happen

• You care about solving big, systemic problems. You look beyond the surface to understand root causes so that you can build long-term solutions for the whole ecosystem

• You believe in not only serving customers, but also empowering them by providing knowledge and tools

Job ID 1931316 Date posted 06/18/2019
Career Areas

#XandrLife

#XandrLife means we’re creating an incredible experience for our people, too. Let our employees show you what it’s really like to work here.

See what it's like here
Back to top