Site Reliability Engineer - Monitoring
We believe that bringing people together from different backgrounds, experiences and perspectives makes for a healthy workplace, a more successful business and a better world. We value diversity and encourage everyone to come and soundtrack the world with us.
At Epidemic Sound we are reinventing the music industry. Our carefully curated catalog, with over 40 000 tracks and 90 000 sound effects, is tailored for storytellers, streaming services, and in-store soundtracks. Countless clients around the world, from broadcasters, production companies, DSPs, and YouTubers rely on our tracks to help them tell their stories. Epidemic Sound’s music is heard in hundreds of millions of online videos daily, across millions of playlist streams, and in thousands of in-store locations. Headquartered in Stockholm, we’re spread across offices in New York City, Los Angeles, Seoul, Hamburg, and Amsterdam. We’re growing fast, have lots of fun, and are taking the music industry with us.
We are now looking for a Site Reliability Engineer with a strong focus on monitoring to join our dynamic SRE team. In this role, you will help drive best practices in monitoring and observability, help implement SLI / SLO / Error budgets and help product teams measure reliability.
How you will make an impact
Enhance our monitoring capabilities using tools like Thanos, Prometheus and OpenTelemetry.
Implement SLIs, SLOs, and tracing to optimize system performance and reliability.
Collaborate closely with product development teams to ensure observability, resilience, and performance needs are met when building new features and services.
Coach engineering teams in improving their monitoring strategy and best practices.
Embrace teamwork through practices like code reviews, pair programming, and mob programming.
Engage in continuous learning through hack-days, courses, conferences, and tech-talks, and share your knowledge with your colleagues.
We believe that to succeed in this role, you have experience in:
Strong understanding of SRE as an engineering practice.
Experience with monitoring tools, such as Prometheus (and a Prometheus HA layer), Tracing and a deep understanding of monitoring best practices.
Solid understanding of modern web architectures, system design, and software engineering principles, with the ability to apply them in designing scalable and robust solutions.
Proficiency in implementing SLIs and SLOs in a production environment.
Strong programming skills in at least one language (We use Go and Python).
Experience mentoring and supporting colleagues and engineering teams.
Demonstrated ability to troubleshoot distributed systems and drive operational excellence, including writing architectural diagrams, best practices, standards, and operating procedures.
Experience working with Kubernetes.
It would also be music to our ears if you have experience with:
Google Cloud Platform.
Service Mesh, ideally eBPF Cilium.
Curious to learn more about who we are and what we do? Check out our brand new "About us" page → https://www.epidemicsound.com/about-us/
Do you want to be a part of our fantastic team? Please apply, in English, by clicking the link below.
- HQ (Stockholm)
- Remote status
- Hybrid Remote
- Employment type
About Epidemic Sound
Epidemic Sound, the market leading platform for restriction-free music, is headquartered in Stockholm, heard around the globe and on a mission to soundtrack the world.The company has democratized access to music for storytellers. Its innovative digital rights model paves the way for creators - everyone from YouTubers to small businesses to the world’s largest brands - to use restriction-free music to take their content to the next level, whilst simultaneously supporting the musicians it works with both financially and creatively.The company was co-founded in 2009, operates across the globe, and has offices in five major cities: Stockholm, New York, Los Angeles, Seoul and Amsterdam. Epidemic Sound is backed by EQT, Blackstone, Creandum, Atwater Capital, Alecta. AMF, and TIN Fonder and its Chairperson is Andrew Fisher, former CEO and Chairman of Shazam.
Site Reliability Engineer - Monitoring
Loading application form
Already working at Epidemic Sound?
Let’s recruit together and find your next colleague.