Dick’s Sporting Goods is looking for Reliability Engineers (RE) with a passion for system reliability to join our Reliability Engineering organization. As part of this engineering team, you will build reliability into our systems, infrastructure, and applications.
Our goals are ambitious and very focused on results and include user-facing applications, observability, production excellence, reliability, errors elimination, efficiency, and automation of manual and repetitive tasks. The RE role at Dick’s Sporting Goods (DSG) provides an opportunity to blend system design and software engineering skills with passion for troubleshooting and defects elimination to address an ever-changing applications and environments with scalability and reliability challenges. This is an opportunity for you to join us on this journey and have a real impact on how we support our customers and build software.
The RE will work with other Reliability Engineers (RE), Product Managers, and Developers practitioners to produce and ensure highest levels of availability and reliability of all our customer facing websites, third party interfaces and legacy application services. The RE is expected to work with management, peers, and customers to define and implement the technical vision, improve monitoring tools, error detections, defects elimination while improving Mean Time to Detection/Resolution, and overall service availability and customer satisfaction.
Troubleshoot high severity e-commerce, infrastructure and legacy business applications/websites performance and availability issues and manages the incident lifecycle to resolutions.
Lead root cause analysis/investigations through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability. Conducting Blameless Post Incident Review is expected.
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health. You're expected to be on- call and have strong written communication skills and be able to develop working relationships with coworkers.
Experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale.
Work across multiple project teams simultaneously to support rapid development efforts.
Solve complex, business critical issues that impact bottom line financial numbers and customer loyalty/experience.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Contribute positively to open source projects developed by DSG and join existing communities. Navigate this broader ecosystem and structure projects with upstream/ downstream opportunities in mind.
Identify and integrate with third-party solutions where it makes the most sense.
Use data to understand the availability, reliability, and sustainability of our software.
Bring experience, pragmatism, empathy, and composure to interactions with teams outside of the RE organization.
Work frequently with Product teams on shared goals and cross-team projects.
Balance planned and reactive work using basic project planning techniques and technical roadmaps.
Work and collaborate across teams such Application services, Capacity Planning, Hardware, Network, and Datacenter Operations.
Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments.
Experience negotiating SLIs, SLOs, and SLAs with product owners.
Our teammates know that there is an athlete behind every in-store and eCommerce transaction. We go beyond the expected to build technology that makes the DICK’S Sporting Goods’ experience innovative and hassle-free.
HAVE A PASSION FOR SPORTS.
We believe that sports make people better and we’re determined to be the best sports company in the world. Whether you’re an athlete or sports enthusiast, we bring our passion for the game into everything we do.
GET BETTER EVERY DAY.
The journey is never over. We know that to be the best, we must get a little better each day. We focus on delivering 1% more in everything we do.
What we’re looking for
3-5+ years of applying reliability engineering principals to distributed services.
Understanding of and comfort with the GNU/Linux operating system.
Proficiency in high-level languages such as Ruby, Python, and Bash.
Exposure to system-level languages such as Go, C/C++.
Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt.
Networking basics: TCP vs UDP, basic troubleshooting, HTTP – load balancing, firewall, private networks, multi-tier design, scale-out, persistent data
Databases – at a minimum understands the basics – select/insert
Familiarity with standard infrastructure concepts like load balancers, firewalls, object storage and where/when they might be used.
Service Management – Incident Response, Change, and Problem Management.
Experience with Kubernetes and Docker.
Cloud computing concepts (not necessarily provider specific) – VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation.
Experience operating a platform, software as a service, or shipping software.
Experience as an open-source contributor.
Intellectual curiosity, problem solving and openness is key to its success. Mindset for solving production systems issues and understanding root cause while providing “Detective work” and automating away toil – doesn't like boring repetitive tasks. Enjoys digging into new problems.
Knows when to ask for help and when to dig more on their own
Can work on different tasks in different systems week to week
Capable of driving and focusing on results given in some cases given an ill-defined problem, such as "this is slow", and developing metrics and making measurable improvements