So, you want to be able to deliver 100 million notifications a month, a completely reasonable goal for many SaaS and app developers. Put this number into perspective, 7 days a week, every single hour you dispatch approximately 148,810 notifications across 3 or more communication methods. This does not scale very easily by default - but it can scale.

We had a challenge last year, which is being able to send hundreds of thousands of messages a day. We send messages via SMS, email, in-browser messages, and even fax - this is a headache on its own. But, what if you had no other option than to just make your existing infrastructure scale? I was faced with this challenge. Here's how I overcame it.

Goals

We want to be able to:

  1. Send notifications via SMS, Fax, Email and in-browser, in near real-time
  2. Quickly let the caller know if the message was sent or failed, and retry if failed

Designing the system

First, our existing design (python, with a mixture of PHP to validate requests) just isn't going to cut it anymore, it isn't very scalable. We used a PostgreSQL database underneath it, and a lot of bad practices in the code. It was quickly determined we need to scratch the entire project and re-code from scratch to meet the new demand.

Evaluating Language Options

We love PHP, we love Python - they work beautifully. I don't want to just throw away a good-enough language just for the sake of it – but alas, I reviewed what is available to use.

GoLang

Off the bat, Go seemed like a perfect language to use. It's compiled, it's got great syntax, it's easy to maintain, and we could automate deploying more nodes very fast if all we had to do was download the binary, execute it in a container. Let's Go with Go.

Python

I love you Python, I really do. It's not personal, but after long consideration, Python will be used just to monitor the infrastructure, and push retry attempts.

PHP

I decided to keep PHP in, it's the administrative frontend, where we see all messages queued, failed, and incoming. It's running Laravel Framework, a ton of middleware, and connects right into the queue.

Go, Kafka!

We knew this had to scale, so I wanted to pick something that seems (from our testing) to scale good-enough. Apache Kafka was the queue/broker we picked. Being able to have distributed nodes, spinning up more queues on demand, and having reasonable oversight is a great selling feature.

Designing our the system will interact

First of all, when our SaaS sends the notification server a request, it needs to be very explicit on what it wants us to do about it. For example, if we have a Chat Message for a user in the Browser, our payload would look similar to this:

{
  "type": "CM_BROWSER",
  "user": {
      "uid": 76143021,
      "mq-uuid-btag": "A36701-BCDEF-779821-C31024-BA",
      "name": {
          "first": "Bob",
          "middle: "Lee",
          "last": "Smith",
          "expr": "he/him"
      }
  },
  "requester": {
      "type": "WEB_APP_BACKEND",
      "service": "CM_DA_MESSAGE_SERVICE",
      "asap": "yes"
  },
  "details": {
      "title": "New Message",
      "content": "{message|sub(0,15)|'...'}"
  }
}

Our system will pickup this, see it's type is CM_BROWSER (Chat Message, Browser Origin). The user field defines who this message should go to, the "mq-uuid-btag" field is the unique ID generated for this request on behalf of the user. Next we have the users name we can swap in if set in the details.content block.

Our requester lets us track what caused this message to trigger, in this case - it's an automated alert, coming from the Chat Message Service, we flag it as ASAP so it's pushed up in the queue priority.

Lastly, we see the details, depending on the type of request this field will change, for DA (websocket) messages, it's a simple title and content field. For more complex things like SMS, we accept from number, bill to account, among other fields.

Where do we send the message?

You've crafted your request, and you need to dispatch it – great news, the new Go powered backend is ready to serve your request. Every processing node is capable of accepting a user message and pumping it into the queue. They are setup on a round-robin system, so accessing the backend URL (eg. https://kf-queue.services.kuby.ca) and PUSHing this data will trigger it to be added to the queue (*authentication required*).

From here, our Go daemon sends back an internal ID we use to track the request. We push this back to the client like so:

{
    "type": "FULFILLMENT_REQUEST",
    "details": {
        "tag" "{uuid}",
        "pos": 94012
    }
}

We send them their UUID, and position in queue. From here, we have "Handler" nodes which will fulfill this request (more on that in part two).

Handler, handle me!

Our handler is a mixture of Python and Go, Go tracks our request, once it says "FULFILLED", it triggers an HTTP POST to our python node. Our python node will accept the FULFILLMENT details, and push counts into our database. Go will then connect over the websocket (or, whichever platform used), push the message using the UUID from the original request, and delete this from our queued stack.

Now, we have a primitive system with a message broker, round-robin entries to servers (on DigitalOcean, AWS) and a large message queue. Using this, we're able to push out 100 million notifications a month.

Side note: We use numerous cloud providers in the event one of our regions go down, we can automatically spin up and keep delivering without failure.

At what cost?

As of April, 2019 - our monthly average infrastructure and development time costs are:

Developer Maintenance: $3,450

AWS Infrastructure: $695

DigitalOcean Infrastructure: $315

Total: $4460 per month.

Stay tuned for part 2!

In part two, I'll go into more in-depth technical analysis of how this all works, and why message queues / brokers are vital when needing to handle large amounts of real time data.

infrastructure code containers engineering

Mike

Senior Software Engineer, Labber, Sysadmin. I make things scale rapidly. Optimize everything.

Read More