I was reading some tweets about scaling and how Laravel was a bit behind other frameworks and I remembered I have a cool scaling story to tell.
Jack Ellis already wrote a very interesting blog on how Fathom Analytics scaled Laravel, but my story is about different.
You're about to read how we scaled to over a hundred million jobs and peaks of 30,000 requests/minute a timespan of only twelve hours, using nothing but Laravel, MySQL and Redis.
First, I must give some context: in 2019 I joined a pretty cool product: a SaaS that allowed companies to send SMS marketing campaigns.
The app would only handle the actual sending and charge a markup for each SMS --- the companies would be responsible for updating lists and signing with an SMS provider we supported.
The stack was Laravel, Vue and MongoDB.
At the time I joined, the app was handling maybe 1 to 2 million messages a day.
It is important to note something: sending a message wasn't so trivial. To each SMS that was sent, we should expect at least one webhook --- the "delivery report". Any replies to an SMS (any message sent to a number handled by us, in fact) would also come as a webhook, and we'd have to internally tie that to a sent message.
The platform had two places with a lot of traffic/processing: list uploading and message sending.
This is what the DB schema looked like:
Don't ask me why there were 3 messages collections. This is going to become a problem real soon.
Uploading a list wasn't so trivial: they were really big CSVs (think 50M+ contacts) and handling them was tough. The file would be chunked and a job dispatched to upload records --- now, uploading these records was also tricky.
We needed to fetch geographical data based on the number --- to do that, we'd divide the number into 3 pieces and use that to fetch geodata from some tables in our database (or rather, collections). This was a relatively expensive process, especially because it'd happen to each uploaded contact, so for a 50M list you could expect at least 50M queries.
Handling campaigns was also tricky: when a campaign was paused, records would be moved from pending messages to paused messages.
When a message was sent, it'd be deleted from pending messages and moved to sent messages.
It's also important to note that when someone replied to a message, that'd, maybe, trigger a new set of messages to that number (and webhooks). So a single message could generate up to 4+ records.
You don't have to think a lot to figure out this wouldn't scale very well. We had lots of problems very quickly:
- Uploading lists would never work correctly. They were huge, and more often than not the jobs would timeout consecutively until the queue dropped them.
- Creating a contact was complex and intensive: fetching geographical data was fairly expensive and there were some other queries involved. Creating a single contact involved hitting multiple places of the app.
- When lots of campaigns started running, the system would go down because we'd get too many requests. Since it was synchronous, our server would take awhile to respond, requests would pile up and then everything would explode.
- Mongo worked really well until we had a couple million records in each collection. Copying data from one collection to the other was incredibly expensive --- each one of them had unique properties and refactoring wasn't viable.
- Pushing features, fixes and improvements was very hard. There were no tests until I joined, and even then we didn't have a robust suite. Getting the queue down was the number 1 worry.
- The sending queue was actually processed by a script written in Go. It basically kept reading from pending outbounds and sending messages, but it was fairly basic --- there was no UI we could check and adding new sending providers was very problematic since that script had to be changed as well.
The app was, clearly, very poorly designed. I'm not sure on why they choose to use 3 collections for sending messages, but that was a huge problem.
The company tried hiring a Mongo specialist to do some magic --- I'm no DB specialist, but I remember there was a lot of sharding and the monthly bill was almost hitting 5 digits.
That allowed us to hit ~4M/sends a day, but it was still very problematic and data had to be cleaned up frequently.
At around that time It was decided I'd go into a *black ops mission *and rebuild this thing from scratch as an MVP. We didn't need many features (I haven't mentioned 1/3 of them --- there were a lot) for that --- just validate that we'd be able to send those messages comfortably.
I didn't have lots of experiences with microservices and devops so I just decided to use what I knew and ignore the new, shiny things.
I decided to use Laravel, MySQL and Redis. That's it. No Kubernetes, no Terraform, no microservices, no autoscaling, nada.
The new DB schema looked kinda like this:
Some other business rules I didn't mention:
- During all sends, we needed to verify wether that number had received a message from that same company within 24 hours. That meant an extra query.
- We needed to check if the contact should receive the message at a given time --- SMS marketing laws only allowed contacts to receive messages between a certain timeframe, so checking the timezone was extra important.
- In every inbound reply, we needed to check if had any stop words --- those were customized, so that also meant an extra query. If it did, we needed to block that number for that company --- again, one more query.
- We also needed to check for reply keywords --- those were words that'd trigger a new outbound message. Again, extra query and maybe an entire sending process.
- Every campaign was tied to an Account that had many Numbers. There was a calculation, at runtime, to determine how many messages that account could send, in a single minute, without burning the numbers or being throttled.
To solve those problems, I relied in two of my best friends: Queues and Redis. To handle the queues, I went with the battle-tested, easy to use, Laravel Horizon.
The new application looked like this:
- Every message was stored in the messages table, with a foreign key pointing to the campaign it belonged to and a sent_at timestamp field that was nullable. That way it was easy (and fast, with indexes) to check pending and sent messages.
- Each campaign had a status column that determined what should happen: pending, canceled, paused, running and completed. Pending meant the messages were still being added into the table.
- Nothing was processed synchronously --- everything went into a queue. From webhooks to list imports to contact creation to message sending.
- When a list was imported, it was processed in batches of 10,000 --- that allowed the jobs to be processed rather quickly without us having to worry about timeouts.
- When a campaign was created, the messages were generated in batches of 10,000 --- when the last batch was generated, the campaign status would change to paused.
- Remember the geographical data stuff? That was super intensive. Imagine hundreds of millions of contacts being imported by different companies in a daily basis.
That was deferred to Redis --- more often than not some numbers would share some of the records we'd use to fetch geographical data, and having those cached made things much faster. - Message processing remained complex, but easier to handle: the entire process was based on accounts instead of campaigns since we needed to respect the max throughput of each account, and there could be several campaigns using the same one.
There was a job scheduled to run every minute that calculated how many messages an account could send, fetched the correct number of pending messages in a random order, and then dispatched a single job for each pending message.
Remember stop and reply keywords? That went into cache as well.
Determining whether an outbound was sent recently? Also cache.
Horizon was orchestrating a couple queues --- one to import CSVs, other to import each contact, one to dispatch account jobs, one to send the messages, one to handle webhooks, etc.
The infrastructure piece looked like this:
I can't remember the size of each server from the top of my head, but besides MySQL and Redis, they were all pretty weak.
With that stack the app managed to send over 10 million messages and over 100 million queued jobs in a 12-hour timespan with ease.
It went over 1B records pretty quickly, and it was still smooth as butter. I don't remember the financials, but the monthly bill went from 5 digits (+ the DB consultant) to under a thousand dollars.
No autoscaling, no K8s, nothing --- just the basics and what I knew relatively well.
A couple thoughts on the overall infrastructure:
Handling indexes properly on MySQL paid off greatly --- we didn't have any unless we needed to.
Redis was extensively used and with generous TTLs --- its cheaper to throw more RAM into the stack than to have the system go down. Overall, it worked pretty great, but handling cache invalidation was tricky at times.
Rewriting how messages were sent made things so much easier, since I could encapsulate each driver's unique behavior into their own class and have them follow a contract.
That meant that adding a new sending driver was just creating a new class, implementing an interface and adding a couple methods --- that'd make it show in the UI, handle webhooks and send messages.
Regarding Laravel Horizon, one important thing: the jobs needed to be dumb. Real dumb.
I'm used to passing models as arguments to jobs --- Laravel handles that extraordinarily by deserializing the model, passing it to the job and then serializing it on the queue worker. When that happens, the record is fetched from the database: a query is executed.
This is definitely something I did not want, so the jobs had to be as dumb as possible in what relates to the database --- all the necessary arguments were passed directly from the account handler job, so by the time it was executed it already knew all it had to.
No need to pass a List instance to a job if all it needs is the list_id --- just pass int $listId
instead. 😉
To wrap it up, tests made a huge difference. The old application didn't have a lot besides the ones I wrote and it was fairly unstable. Knowing that everything worked as I intended gave me some piece of mind.
If I were to do this today, I'd probably pick some other tools: Swoole and Laravel Octane, for sure. Maybe SingleStore for the database. But overall, I'm happy with what I picked then and it worked super well, while still leaving room for improvement and maybe switching a couple things.