Send a message from one microservice to another in Azure Service Fabric (APIs)
What is the best architecture, using Service Fabric, to guarantee that the message I need to send from Service 1 (mostly API) to Service 2 (mostly API) does not get ever lost (black arrow)?
1.a. Make service 1 and 2 stateful services. Is it a bad call to have a stateful Web API?
1.b. Use Reliable Collections to send the message from API code to Service 2.
2.a. Make Service 1 and 2 stateless services
2.b. Add a third service
2.c. Send the message over a queuing system (i.e.: Service Bus) from service 1
2.d. To be picked up by the third service. Notice: this third service would also have access to the DB that service 2 (API) has access to. Not an ideal solution for a microservice architecture, right?
3.a. Any other ideas?
Keep in mind that the goal is to never lose the message, not even when service 2 is completely down or temporary removed… so no direct calls.
I’d introduce a third (Stateful) service that holds a queue, ‘service 3’.
Service 1 would enqueue the message. Service 3 would run an infinite loop, trying to deliver the message to service 2.
You could use the pub/sub package for this. Service 1 is the publisher, Service 2 is the subscriber.
(If you rely on an external queue system like Service Bus, you’ll lower the overall availability of the system. Service Bus downtime would lead to messages being undeliverable.)
I think that there is never completely any solution that is 100% sure to never loose a message between two parties. Even if you had a service bus for instance in between two services, there is always the chance (possibly very small, but never null) that the service bus goes down, or that the communication to the service bus goes down. With that being said, there are of course models that are less likely to very seldom loose a message, but you can’t completely get around the fact that you still have to handle errors in the client.
In fact, Service Fabric fault handling is mainly designed around clients retrying communication, rather than having the service or an intermediary do that. There are many reasons for this (I guess) but one is the nature of distributed, replicated, reliable services. If a service primary goes down, a replica picks up the responsibility, but it won’t know what the primary was doing right at the moment it died (unless it replicated over it’s state, but it might have died even before that). The only one that really knows what it wants to do in this scenario is the client. The client knows what it is doing and can react to different fault scenarios in te service. In Fabric Transport, most know exceptions that could “naturally” occur, such as the service dying or the network cable being cut of by the janitor are actuallt retried automatically. This includes re-resolving the address just in case the service primary was replaced with a secondary.
The same actually goes for a scenario where you introduce a third service or a service bus. What if the network goes down before the message has completely reached the service? In this case only the client knows that something went wrong and what it intended to send. What if it goes down after it reached the service but before the response was sent? In this case the client has to assume the message never reached and try to resend it. This is also why service methods are recommended to be idempotent – the same call can be made a number of times by the same client.
Even if you were to introduce a secondary part, like the service bus, there is still the same risk that the service bus goes down, or more likely, the network connecting to the service bus goes down. So, client needs to retry, and when it has retried a number of times, all it can do is put the message in a queue of failed messages or simply just log it, or throw an exception back to the original caller (in your scenario, the browser).
Ok, that’s was me being pessimistic. But it could happen. All of the things above, its just that some are not very likely to happen. But they might happen.
On to your questions:
1) the problem with making a stateless service stateful is that you now have to handle partitions in your caller. You can put up Http listeners for stateful services, but you have to include the partition and replica information in the Uri, and that won’t work with the load balancer, so in this case the browser has to select partition when calling the API. Not an ideal solution.
2) yes, you could do this, i.e. introduce something else in between that queues messages for you. There is nothing that says that a Service Bus or a Database is more reliable than a Stateful service with a reliable queue there, it’s just up to you to go for what you are most comfortable with. I would go for a Stateful service, just so I can easily keep everything within my SF application. But again, this is not 100% protection from disgruntled janitor with scissors, for that you still need clients that can handle faults.
3) make sure you have a way of handling the errors (retry) and logging or storing the messages that fail (after retries) with the client (Service 1).
3.a) One way would be to have it store it localy on the node it is running and periodically (RunAsync for instance) try to re-run those failed messages. This might be dangerous in the scenario where the node it is running on is completely nuked and looses it data though, that data won’t be replicated.
3.b) Another would be to use semantic logging with ETW and include enough data in the events to be able to re-create the message from the logged and build some feature, a manual UI perhaps, where you can re-run it from the logged information. Much like you would retry a failed message on an error queue in a service bus.
3.c) Store the failed messages to anything else (database, service bus, queue) that doesn’t fail for the same reasons your communication with Service 2.
My main point here is (and I could maybe have started with that) is that there are plenty of scenarios where only the client knows enough to handle the situation. So, make sure you have a strategy for handling faults in your clients.