The Bullish Group has built an ecosystem focused on developing financial services for the digital assets sector through technology and investment businesses. These include: Bullish Exchange - digital asset trading services that utilize central limit order matching and proprietary market making technology to deliver deep liquidity and tight spreads within a compliant framework. The business is licensed by the Hong Kong Securities and Futures Commission, German Federal Financial Supervisory Authority, and the Gibraltar Financial Services Commission. Since its launch in November 2021, Bullish Exchange has surpassed US$1.3 trillion in total trading volume, with 2H 2024 average daily volume exceeding US$2 billion. Bullish Capital - an investment company which offers strategic capital, industry expertise and an extensive network of resources to support initiatives that connect conventional finance with the revolutionary possibilities of the digital economy. CoinDesk - an award-winning media, events, indices and data business servicing the global crypto economy.
Reports to:
Vice President, Platform and OperationsJoin our dynamic team as a Lead Site Reliability Engineer (SRE) and play a crucial role in elevating the reliability, scalability, and efficiency of our essential services.
What You'll Do:
As a Lead SRE, you'll be instrumental in shaping our systems' future. Your responsibilities will include:
System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing.
Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions. You'll advocate for treating "operations as a software problem" throughout the organization.
Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues.
Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence. Expect to participate in on-call rotations as a primary escalation point.
Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle. You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability.
Capacity Planning & Optimization: Monitor and analyze system capacity and performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth.
Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency.
Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members.
What You'll Bring:
We're looking for a highly experienced and passionate SRE leader with:
12+ years of experience in Site Reliability Engineering, DevOps, or a related critical operations role, with a proven track record of leading significant reliability initiatives.
A Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience.
Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system integrations.
Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes).
Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures.
Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles.
Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry).
Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure.
Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams.
Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment.
Expert-level debugging and root cause analysis capabilities across complex distributed systems.
Bonus Points For:
Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi).
Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies.
Significant experience designing, implementing, and operating microservices architectures.
Contributions to open-source projects related to SRE, operations, or cloud-native technologies.
This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability.
Bullish is proud to be an equal opportunity employer. We are fast evolving and striving towards being a globally-diverse community. With integrity at our core, our success is driven by a talented team of individuals and the different perspectives they are encouraged to bring to work every day.