Growing up: from zero to 4 million requests/second

1010: Building from scratch

27. 10. 2022

The story of how we started a completely new product from scratch 8 years ago, how we scaled the tech behind it, and what you can learn from our experience.

Talk recap

This talk was automagically recapped with help of Wordsmith AI.
Subjected to paraphrasing and mirror errors.
  • Zemanta, a small startup of just 25 people, decided to build something completely new in 2014. This new thing was called native DSP, and it was a very advanced advertising product, which marketers can use to each millions of highly relevant users with their ads.
  • This system processes over 4 million bid requests per second, generates and processes over 150 terabytes of data per day, and does over 1 billion predictions per second. It runs on over 1000 very powerful machines spread across five data centres on three continents.
  • Outbrain, which is our parent company, did over $1 billion in revenue last year, and this system played a very important role in achieving that. The engineering team learned a lot through this process, but I want to focus on one very specific insight. The first thing we did when we began building this product was to organise a hackathon. We evaluated different technologies and decided which ones we were going to use.
  • We wanted to have a good mix of stuff that we are familiar with, and is not risky to us, but also try some new things which can bring additional value. So we chose Python, but also added AngularJS, go Lang, and some other new things.
  • Choosing the right tools is extremely important, and you should invest enough time in that. If you're successful, you will be stuck with these tools for a long time.
  • Eight engineers formed a team, and each had a specific superpower, like back end, front end, machine learning infrastructure. They were all full stack, and could work on any part of the code base.
  • We decided to go with the simplest possible option for architecture and to build start with just enough structure to keep things under control.
  • We decided to go with the cloud, and chose Amazon web services specifically. We didn't care too much about the costs. We optimised the development process for fast iteration by keeping code reviews lightweight, unit testing only critical parts, and enabling all engineers to work on any part of the code base.
  • When the first version of the product was built, we onboarded the first users within less than one month, and then we started to gain traction and scale more significantly. We were acquired by Ill brain on July 25 of 2017, and now we are growing rapidly.
  • There were some pretty aggressive goals set for us, so we had to make a few decisions on how to work with our brain. In most cases, we decided to stay independent and keep rolling on our own, because we felt that we needed as much autonomy as possible.
  • Outbrain had three major data centres and a dedicated department have 40 engineers that manage this infrastructure. However, the company decided to stay independent and manage its own infrastructure because it felt that moving fast was important to deal with hyper growth.
  • At every point in time, at least one part of the system was struggling to keep up with all this load, because it was simply built with very different assumptions in mind. In some cases, the problem was easily solved by adding more machines, but in other cases, the problem was more difficult to fix.
  • During this phase, we introduced a bunch of new tools to better deal with the scale we were facing, but we always tried to scale the existing tooling first, and then we started thinking about adding something new. The team also grew, and we added more experts in specific areas.
  • As we grew, we realized that the cloud was simply not going to work for our use case, so we replaced the serving infrastructure of our bidder with on prem servers that we were managing ourselves. When you're smaller, downtime is bad, but you can get up and running decently fast. As you scale, downtime starts costing you more and more, so you have to invest more in testing and monitoring, and practice blameless post mortems.
  • In 2021, we went public on NASDAQ, and this marked the transition into the next phase of our journey, which I call the mature phase. Because in the growth phase, you're doing everything in a hurry, and some of the solutions you introduce are not built in the most optimal way.
  • As the growth slows down, the speed of execution isn't the most important thing anymore, and instead the company is focusing on stability, robustness, compliance and efficiency, or in other words, on building the thing to last in the long run.
  • When you're building something from scratch, you need to be able to move very quickly, to iterate fast, to learn fast and to have fast feedback loops. At the same time, you don't really need to worry too much about building things perfectly.
  • At every phase, you need to optimise for different things, and that impacts how you work, what tools you use, and who you work with. Take these principles to heart, consider what phase you're in, and adjust your approach accordingly, and you will find success.
Florjan Bartol OutbrainVP Engineering @ Outbrain. Leading a 40+ person engineering department responsible for the Real-Time Bidding infrastructure: a distributed system processing 1.7M+ requests per second and 80% average YoY growth.

This talk was powered by

BTC City Ljubljana

Audience questions for Florjan

These questions were asked by the attending audience of about 180 participants on 27. 10. 2022 at Maximize your next knowledge experience by attending our event in-person.

How did you manage to achieve that great scalability?
I didn't dive into the technical details too much in this talk, but it's a good question. I think essentially, we the answer is horizontal scalability. So we try to build our systems with as many layers as possible, but the layer that serves traffic is actually built in a way that each node in this layer is as independent as possible, which means that in theory, you can just keep adding more machines. And you can you know, you can get to the number that I mentioned, which is 1000 machines and each of them can independently serve traffic. So it means that this way, you can actually scale pretty well. How we achieve that is essentially by making these machines very independent in a lot of ways. So they're not really doing any queering of the database; they have all the information that they need loaded straight into memory which happens periodically. And similar solutions like that.
What system did you have in place to predict the bottlenecks?
We were not planning. We didn't think we could ever get to this level of success, in the earliest stages. But when we started getting to 200, 300, 400 thousand users, we started thinking, "Yes, I think we can get to a million." The bottlenecks are real, and we're going to have to focus on them. We have a strong culture of monitoring across all systems, so we were simply observing these metrics and seeing where things were starting to break down. For example, if a process used to take 5 minutes and now it's taking 15 minutes? That's not going to work for very long! So we started focusing on that issue. Nothing more sophisticated than that—I'm afraid.
Why did you merge with Outbrain and what were the benefits?
We still are merging, this is still ongoing, it's actually only starting. So up until about last year or so we were still working mostly independently. But as I was explaining in the talk, as our scale, as our, as the value of autonomy, of our own autonomy became less important. We saw that there are certain benefits in merging with All right, so a concrete example, again, will be around infrastructure. So when we had we were managing our infrastructure independently gave us a lot of flexibility. Because to be concrete, when we were running out of servers, to get the servers in our brain, it was a very bureaucracy heavy process that we went to go through, we had to go through, and we had to do. And if we were independent, if we were managing our own infrastructure, which we were in case of some data centres, we could just do it ourselves and have three people in the team who were handling this and it got done. So these are what I mean by autonomy that enables speed of execution.
How did you handle the database loads?
The simplest explanation is to basically say that all of these most intensive parts, so the serving part that actually handles 4 million bid requests and does the billion product predictions per second, does not actually do a lot of communication with databases because If it was doing that, databases will definitely become a bit bottleneck. But what instead what we did, we tried to put all the information that these bidders need to properly respond to these requests in basically a large file, which is then loaded onto each machine, and put into memory. So basically, each machine is not querying a database, but it's querying its own memory, to get the information it needs. Writing is a bit different here. But in our use case, it what actually works is that we just collect the edge, each machine just writes a log of all the events that happened, and pushes that onto some central points, which currently is still s3. And this is where I was talking about 150 terabytes of data proc generated per day, this is what happens.
Have you looked into hybrid clouds to combine the affordability with local on premises with quick growth in the cloud?
So, a couple of years ago, we only had four data centres. We decided that we needed a new one, and we needed it soon. So the first version was built on AWS, then when it started to grow, in the meantime, we built an on-prem data centre there. But what we really don't do is just because it's more complex. So now this is the it's much simpler—we have this data centre and a bunch of machines. And we need to predict far enough in advance how many machines we're going to need. And usually this is like three to six months—how many machines we're going to need. And as long as we're doing that, it's much simpler than actually trying to have this hybrid approach. So I don't know maybe sometime in the future, this will make sense.
What is the usual process when you encounter an incident? Is the support on call?
So we do have on-call. And we try to basically have it, it's based on teams mostly. So this means that whenever something goes wrong in one of the services or one part of our system, it will be the relevant team, that will be paged. And that team will then try to fix the issue. And usually, I'd say, you know, 90% plus cases, they're able to handle things independently and very quickly. But of course, every so often, we have broader issues that, you know, take our system down for a lot longer. And in that case, usually, everybody starts jumping in from, you know, a bunch of engineers to the execs, the CTO and so on. And it's just basically a large number of people in a single Slack channel coordinating contract to try to fix the issue by trying to think of as many possible solutions as possible. So this happens also one example of that is a couple of years ago we had an incident that was actually caused by a bug in the cloud providers code which took our whole system down for eight hours which was a very fun fun time and cost.
Have you managed to introduce continuous deployment?
We have something like continuous deployment from the very beginning, but it has evolved over time. Our goal was to increase autonomy for engineers, so we wanted to enable them to deploy as quickly as possible. We kept this system up until today, which means that any engineer can go into Slack and deploy a change in minutes. The current system has 1000 servers, and any engineer can deploy a new version from our repository. Currently, the deploys take around 15 minutes or 20 minutes—which is more than we would like—but still pretty impressive given the scale of our system.
How did the hackathon help you to pick the right tools?
The hackathon wasn't about building whatever you want. It was about deliberately testing and evaluating specific technologists, like me. I tested three different front-end frameworks—AngularJS, Ember, and React—and built simple apps to evaluate each one. Then I presented my findings to the rest of the team for discussion, and we decided which framework we wanted to pick.
Are you deploying services on bare metal using virtual machine or Kubernetes?
We have many different servers services. And when I was saying we moved away from Cloud, that's only for the serving layer. So a lot of our services are still on AWS, although we are trying to move away from that as well. But so when it comes to AWS, we do use Docker and we use Kubernetes. For that, or some AWS version of it. When it comes to the bidder, we're actually the bidder is written in Go Lang, which means that, when compiled, it compiles to just a single binary. And there is simply no need. Because each machine in the cluster is running one instance of the bidder, there is no need for Kubernetes authorization of or anything like that.
What's something that you really miss from the previous setup? And what's something that you really don't?
What we really miss is just that when you're small, things just happen so quickly, you are, we were literally all eight engineers on that team, we're in the same room all the time. And that was before COVID, before he breathed working mode. So we were actually there five days a week. And you know, an incredible amount of ideas and collaboration and learning happens this way. That's when your team grows to, you know, five or six times that size, these things become much more difficult. What they don't miss so much is, again, the lack of resources that I was mentioning. So as a startup, you, you are always constrained. So everything, you know, everything you just don't have, you just have less less money, I would say. And as part of the company, even though things are starting to take longer. Some things are really nice that they're just handled for you, or you can just afford because you're in a bigger company.
How do you manage to mitigate downtime or maximise uptime when the breaking changes occur?
There are a lot of ways to design a change, but we try to design all the changes in a way where we could avoid breaking changes. By far, this is the best option. Right now, I can't really think of a case where we would need to shut down our system to handle a big change like that. Mostly just running things in parallel. So we have, let's say, an old data pipeline that we're now trying to improve. And we are building the new data pipeline in parallel, and we are running both of them at the same time. And when we are sure that they are both the same and everything works 100%, okay then we're just going to shut down the old one and move on with our lives! That translates to many different cases too: even if you're doing some change in your code, you know… put in the code for the new way, do it in parallel… delete the old flow!
Why did you decide to actually hire more engineers to handle the on-premise approach?
When we were on the cloud, I think we actually had like two engineers working on that. And then we moved to on prem, and again for the serving layer. And then we expanded that team to three or four engineers. And so we didn't actually, we weren't trying to say we, we only had to expand this team, slightly, people would expect that when you go from cloud to on prem, you need to, you know, dramatically increase your team. For us, that was not the case. Because on the one hand, when I said we didn't work with Algren at all, we mostly didn't. But the truth was that we use their data centres, and they still were handling the very lowest level things for us.
If you will start the journey from scratch again. Would you stick to some original approaches?
We didn't actually throw all the standards out the window in the beginning can just, you know, hack things together. As I mentioned, we still did code reviews, even though we kept them very lightweight. We still wrote unit tests, even though we didn't write them for all of our code. So I think it's, you know, it's a spectrum. You don't need to either don't do it this, these things at all, or do them 100%. In the early stages, this slider has just moved on more on this side towards less of less following of code practices and unit tests, and so on. So as we grew, we're just we're just moving more to the other side of the spectrum. We were trying, we're putting more emphasis on those things.