Airflow at Adyen: Adoption as ETL/ML Orchestrator
Adyen for Platforms is our solution for peer-to-peer marketplaces, on-demand services, crowdfunding platforms, and any other platform business models. It enables merchants to onboard sellers, service providers, and contractors as sub-merchants, and accept payments on their behalf.
When starting our development of Adyen for Platforms, we focused on a key pillar of the Adyen formula: launch fast and iterate. We made sure that we were able to build and ship the product fast, and improve and extend it quickly based on merchant feedback.
During this initial phase, performance, system load, and number of requests, were not real issues yet. This is because in the beginning, the number of requests would be low. However, scalability of the platform was already taken into account from the start. This way, we would not make any decision that would make scaling up impossible.
Therefore, after the launch (or initial phase), it was time to focus on the future: how can we make sure that we are able to handle it if the number of requests doubles? What if it becomes 10 times greater? 20 times?
A clear candidate for improvement was the “API layer.” After all, all requests coming from merchants end up on an external system. In the initial phase, to launch fast, we decided that those external systems would act as proxies: all they would do was forward the request to an internal application that would synchronously process the request. Once the processing was done, a response was returned to the merchant containing either a success or validation errors.
This approach has the advantage that it is simple to reason about and explain to merchants:
However, this approach comes with critical drawbacks, the main one of which is that any database maintenance we execute will have a direct impact on our merchants. For example, during such maintenance, requests cannot be executed directly, and will therefore fail. In this scenario, the merchant has to resend the request again, after the maintenance is finished. This would mean that merchants have to queue up the requests on their side, and reprocess them at a later time. As such, handling this scenario for merchants would complicate their integration to our platform.
Additionally, we want to be able to process requests at the pace we choose ourselves and have more control over when and how requests are executed.
To overcome these challenges, we decided to go with an asynchronous API processing approach and split the requests processing into two steps:
Asynchronous API processing
Even though the first step is rather simple, we need to be careful and take all edge cases into account. If we receive an incorrect address or phone number, we can return a validation error to the merchant who will need to send new requests with correction information. But what should we do if we receive an update request for an account that doesn’t seem to exist yet? We cannot reject it right away since it’s possible that the account creation is queued but not executed yet.
In such cases, we should also queue it and let the internal system take the final decision on whether this update request is valid or not. And in the meantime, we will also return a valid (202) HTTP response to the merchant letting them know that the request has been queued. This is an important change that merchants need to take into account: with this approach, a request may be accepted even though it will eventually lead to a validation error. They should no longer solely rely on our synchronous response, but also listen to our notifications.
If the merchant received a successful response code, they have the guarantee that their request has been queued and will be processed by our internal system. That’s where the second step kicks in: an asynchronous process reads from the local queue and forwards all requests to the internal system. This process can be stopped during maintenance on the internal system, for example for database maintenance. That way, we can still process requests, but we don’t forward them for the time of the maintenance. The only impact on the merchant experience is a slight delay when it comes to receiving webhook notifications.
When the internal system receives a request, it doesn’t execute it immediately, but queues it locally first. The reason for this is that we want to execute requests in the order we received them. If we received a “create” request on external machine A, and an “update” request on external machine B for the same account, we need to be sure that we don’t execute the update before they are created. By having a queue in the internal system, coupled with the knowledge about until when the external systems are “up-to-date,” we have all the knowledge we need to achieve our goal.
The last step is for the internal system to pick up items from the queue and execute them. The most simple and straightforward approach would be to read and execute requests in a single thread. However, that approach wouldn’t scale. Thus, the solution we chose is to have one thread reading from the database then “distributing” the request among multiple threads, which will execute them in parallel and send a notification to the merchant once the processing is done.
The main thread (i.e., the one reading from the database) also acts as an orchestrator, as it can decide to use the same thread for multiple requests. To use the example given above again, two requests impacting the same account should be executed in order, and therefore cannot be executed by two different threads. Thus, the main thread groups the requests based on some key (for example, the account), and makes sure that all requests in the same group are executed sequentially by the same thread.
By moving from a synchronous API processing to an asynchronous one, we were able to decouple the requests receiving part of the process and their actual execution. By doing so, we have better control over how and when the requests are executed. This also improves the merchant’s experience, as they are not impacted anymore when we need to operate an internal maintenance. This solution has now been live for a while and has helped us scale up our application.
By submitting this form, you acknowledge that you have reviewed the terms of our Privacy Statement and consent to the use of data in accordance therewith.