We talk with Jay Ashe from Cava about their current and past Elixir projects and how they are deployed.
Jay Ashe - Cava
Find Jay elsewhere online:
0:40 - Give us a quick overview of the Elixir projects you have in production.
CAVA is a fast-casual mediterranean restaurant chain with 75 stores across the US. Elixir and phoenix power CAVA’s online ordering platform (order.cava.com and the CAVA app). We’ve got a REST (and websockets) api sitting behind react and our mobile apps, and we use phoenix templates for some of our back of house systems.
1:11 - Why are you using Elixir in production?
We have from the start! The application was originally implemented by Chris Bell and his team at madebymany. Chris, by the way, has a fantastic talk from ElixirConf 2016 that goes into our architecture and how we use elixir and OTP constructs to model our business logic. Chris will occasionally talk about the CAVA project on his Elixir podcast, ElixirTalk.
Chris’ Talk - https://www.youtube.com/watch?v=fkDhU-2NWJ8
1:58 - What are some of the high level advantages / disadvantages of Elixir, from your perspective?
Advantages: Elixir and Phoenix gives you rails-esque productivity/developer experience that scales. I think phoenix channels are a great example of this. Build a channel with complex real-time functionality and let it scale effortlessly.
- Hiring and onboarding, depending on your mindset, can be difficult. If you’re used to hiring for experience in your stack, its just going to be more difficult. Lately we’ve started doing one-hour weekly knowledge shares that cover elixir basics and are closely tied to our usage of them. So, here’s a test case, and here are all of the test helpers that we have set up that will help you write that test. We also just sent a new Elixir dev to lonestar elixir
3:59 - What do you use to host your Elixir app?
4:44 - Are you able to get zero downtime deploys?
- As close as possible! We get that out of the box with heroku. When we deploy, heroku won’t point traffic to the new dyno until the app is healthy. We make extensive use of Phoenix channels over websockets, and our clients will reconnect automatically and transparently.
5:10 - Do you cluster the application?
5:52 - How does your Elixir App perform compared to others in your environment?
- I can’t really talk about numbers here, but Elixir is not at all our bottleneck. We don’t have other production applications
6:25 - How are you solving background task processing?
- Quantum for cron jobs, genservers for everything else. We’re running a single elixir application that handles all synchronous and async processing
7:07 - What libraries are you using?
8:59 - 3rd Party Services (i.e. Email, Payment Processing, etc)
- Sendgrid for email, Google for geocoding, slack for some internal alerting of application health, LevelUp for payments. https://www.thelevelup.com/
10:07 - Do you have a story where Elixir saved the day in production?
- Yes and no. So I could tell this story by explaining the issue we saw and the underlying cause at the same time, but I think it would be more fun to tell it like our team experienced it.
- One day at lunch our application started going down. Lots of 500 errors. Red lights flashing. Panic ensuing. Lunch is our busiest time of day, so 1) we thought it was load related and 2) we really needed to fix it
- None of our traditional resources (database, cpu, memory) were constrained and our integrations that were synchronous were fine.
- Our logs were littered with errors from an analytics integration that ran asynchronously on genservers, but it didn’t seem related because we could see the error logs at times when our application was otherwise healthy. The team that used the analytics didn’t have a pressing need for them, and we deprioritized fixing the issue because the bug we were working on was so much more important (that’s foreshadowing).
- I spent a little time looking at websockets, but I was easily able to match the load of the websocket portion of our application on my local machine with no degradations in performance (thanks, phoenix), so that was out.
- At this point the issue was going on every day at lunch and I was getting annoyed at seeing the logs from the analytics integration when debugging, so I spent like 15 minutes finding and fixing the issue (a bad API key, basically)
- Voila, issue gone. Time to grab some lunch.
- We spent a while coming up with an explanation for this. Eventually we learned about max_restarts on a supervisor. By default, if a process crashes 3 times in 5 seconds, the process won’t be restarted again. So if another process (like the one handling a web request) tries to call that process that wasn’t restarted, the caller would crash, and we’d start to get 500 errors, customers couldn’t log in, mass confusion.
- So there are a few takeaways from this story: For a while, elixir saved the day in production.
- A supervision tree prevented failures from the analytics process from affecting customers, until the scale of our failures exceeded the max_restart level.
- Our supervision tree needed some love though, clearly.
- Monitor your resources. CPU is a resource, but calls to another API are also a resource and can get unhealthy too.
15:00 - Are you using any cool OTP features?
- GenServers, definitely. There’s lots we can do asynchronously especially in terms of our integrations. One process per store is a cool model that scales well and keeps issues isolated to a single store.
15:50 - If you could give one tip to developers out there who are or may soon be running Elixir in production, what would it be?
- If you’re on a small team, Heroku or a similar provider might give you a lot of value in terms of infrastructure you can set up and forget.
Learn more about how SmartLogic uses Phoenix and Elixir.
Special Guest: Jay Ashe.