Connecting Kubernetes services with linkerd

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

April 10, 2017

This article was contributed by Tom Yates

CloudNativeCon+KubeCon

When a monolithic application is divided up into microservices, one new problem that must be solved is how to connect all those microservices to provide the old application's functionality. Kubernetes provides service discovery, but the results are presented to the pods via DNS, which can be a bit of a blunt instrument. DNS also doesn't provide much beyond round-robin access to the discovered services. Linkerd, which is now officially a Cloud-Native Computing Foundation project, is a transparent proxy which solves this problem by sitting between those microservices and routing their requests. Two separate CNC/KubeCon events — a talk by Oliver Gould briefly joined by Oliver Beattie, and a salon hosted by Gould — provided a view of linkerd and what it can offer.

Gould, one of the original authors of linkerd, used to work for Twitter in production operations during its crazy growth phase, when the site was down a lot. During the 2010 World Cup, every time a goal was scored, Twitter went down. He was a Twitter user, and after finding himself rooting for 0-0 draws because they would keep the site up, realized that Twitter had operations problems, and he could probably help. So he went to work for them.

In those days, Twitter's main application was a single, monolithic program, written in Ruby on Rails, known internally as the monorail. This architecture was already known to be undesirable; attempts were being made to split the application up, but to keep stability everything had a slow release cycle — new code often taking weeks to get into production — except the monorail, which was released daily. So anything that anyone wanted to see in production in any reasonable timescale got shoehorned into the monorail, which didn't help the move to microservices. It also didn't help that the people who were trying to deploy microservices had to reinvent their own infrastructure — load-balancing, handling retries and timeouts, and the like — and these are not easy problems, so some of them were not doing it very well.

So Gould wrote a tool called Finagle, which is a fault-tolerant, protocol-agnostic remote procedure call system that provides all these services. It helped, so Twitter ended up fixing a lot of extant problems inside Finagle, and finally everything at Twitter ended up running on top of it. There are a number of consequent benefits to this; Finagle sees nearly everything, so you have a natural instrumentation point for metrics and tracing. However, Finagle is written in Scala, which Gould concedes is "not for everyone".

He left Twitter convinced that well-instrumented glue that is built to be easily usable can be helpful; turning his attention to the growing use of Docker and Kubernetes, he wrote linkerd to provide Finagle-like functionality for HTTP requests by acting as an intelligent web proxy. The fundamental idea is that applications shouldn't have to know who they need to talk to; they should ask linkerd for a service, and linkerd should take care of tracking who is currently offering that service, selecting the best provider, transporting the request to that provider, and returning the answer.

Facilities that linkerd provides to assist with this include service discovery, load balancing, encryption, tracing and logging, handling retries, expiration and timeouts, back-offs, dynamic routing, and metrics. One of the more elegant wrinkles Gould mentioned was that it can do per-request routing; for example, an application can send an HTTP header informing linkerd that this particular request should go via some alternative path, possibly a staging or testing path. Many statistics are exported; projects like linkerd-viz give a dashboard-style view of request volumes, latencies, and success rates.

Deadlines are something a microservice connector needs to care about. The simplistic approach of having each individual service have its own timeouts and retry budgets doesn't really work when multiple services contribute to the provision of a business feature. If the top service's timeout triggers, the fact that a subordinate service is merrily retrying the database for the third time according to its own timeout and retry rules is completely lost; the top service times out and the end-user is disappointed, while the subordinate transactions may still be needlessly trying to complete. Linkerd, because it is mediating all these transactions, allows the setting of per-feature timeouts, so that each service contributing toward that feature has its execution time deducted from the feature timeout, and the whole chain can be timed out when this expires. Services that are used in providing more than one feature can take advantage of more generous timeouts when they are invoked to provide important features, without having to permit such a long wait when they're doing something quick and dirty.

Retries are also of concern. The simplistic approach of telling a service to retry after failure a finite number of times (say three) fails when things go bad, because each retry decision is taken in isolation. Just as the system is being stressed, the under-responsive service will be hit with four times the quantity of requests it normally gets, as everyone retries it. Linkerd, seeing all these requests as it does, can set a retry budget, allowing up to (say) 20% of requests to retry, thus capping the load on that service at 1.2 times normal. It makes no sense to set a traditional retry limit at a non-integer value like 1.2; this can only meaningfully be done by an overlord which sees and mediates everything.

This high-level view also allows linkerd to propagate backpressure. Consider a feature provided by several stacked microservices, each of which invokes the next one down the stack. When a service somewhere down in the stack has reached capacity, applying backpressure allows that service to propagate the problem as far up the stack as possible. This allows users whose requests will exceed system capacity to quickly see a response informing them that their request will not be serviced, and thus add no further (pointless) load to the feature stack, instead of sitting there waiting for a positive response that will never come, and overloading the feature while they do so. At this point in the talk, an incredulous question from the audience prompted Gould to confirm that all this functionality is in the shipping linkerd; it's not vaporware intended for some putative future version.

Gould's personal pick for most important feature in linkerd is request-aware load balancing. Because linkerd mediates each request, it knows how long each takes to complete, and it uses this information to load-balance services on an exponentially-weighted moving average (EWMA) basis, developed at Twitter. New nodes are drip-fed an increasing amount of traffic until responsiveness suffers, at which point traffic is backed off sharply. He presented data from a test evaluating latencies for three different load-balancing algorithms: round-robin, queue depth, and EWMA, in an application where large numbers of requests were distributed between many nodes, one of which was forced to deliver slow responses. Each algorithm failed to deliver prompt responses for a certain percentage of requests, but the percentage in question varied notably between algorithms.

The round-robin approach only succeeded for 95% of requests; Gould noted that: "Everywhere I've been on-call, 95% is a wake-me-up success rate, and I really, really don't like being woken up." Queue-depth balancing, where new requests are sent to the node which is currently servicing fewest requests, improved things: 99% of clients got typically fast response; but EWMA managed better than 99.9% of clients seeing no sharp increase in latency.

Linkerd is relatively lightweight, using about 100MB of memory in normal use. It can be deployed in a number of ways, including either a centralized resilient cluster of linkerds, or one linkerd per node. Gould noted that the best deployment depends on what you're trying to do with linkerd, but that many people prefer one linkerd per node because TLS is one of the many infrastructural services that linkerd provides, so one-per-node lets you encrypt all traffic between nodes without applications having to worry about it.

One limitation of linkerd is that it only supports HTTP (and HTTPS) requests; it functions as a web proxy, and not every service is provided that way. Gould was very happy to announce the availability of linkerd-tcp, a more-generic proxy which tries to extend much of linkerd's functionality into general TCP-based services. It's still in beta, but attendees were encouraged to play with it.

Gould was open about the costs of a distributed architecture: "Once you're in a microservice environment, you have applications talking to each other over the network. Once you have a network, you have many, many, many, many more failures than you did when you just linked to a library. So if you don't have to do it, you really shouldn't... Microservices are something you have to do to keep your organization fast when managing builds gets too hard."

He was equally open about linkerd having costs of its own, not least in complexity. In response to being asked at what scale point the pain of not having linkerd is likely to outweigh the pain of having it, he replied that it was when your application is complex enough that it can't all fit in one person's head. At that point, incident responses become blame games, and you need something that does the job of intermediating between different bits of the application in a well-instrumented way, or you won't be able to find out what's wrong. While it was nice to hear another speaker being open about containerization not being some panacea, if I had a large, complex ecosystem of microservices to keep an eye on, I'd be very interested in linkerd.

[Thanks to the Linux Foundation, LWN's travel sponsor, for assistance in getting to Berlin for CNC and KubeCon.]

Index entries for this article
GuestArticles	Yates, Tom
Conference	CloudNativeCon+KubeCon/2017

(Log in to post comments)

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 18:13 UTC (Mon) by ikm (subscriber, #493) [Link]

This got me excited, then I checked the source and found out it's still in Scala. Which is... "not for everyone", right. At least the part where one needs a JVM to run it. I was really hoping for something lighter-weight.

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 19:13 UTC (Mon) by flussence (subscriber, #85566) [Link]

If you can't justify spending a paltry few gigabytes of RAM to middle-manage your infrastructure, maybe you don't yet have the kind of colossal scaling problem that *needs* microservices?

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 19:33 UTC (Mon) by fratti (guest, #105722) [Link]

To be fair, the less you have to scale vertically, the less your horizontal scaling will cost. Though I don't think a 100 MB RAM JVM is going to be a big problem unless you plan on something like Toasters as a Service. GCs usually get efficient if the heap is about 6 times as big as it needs to be, and at that point you're still only at 600 MB RAM. At any rate, I doubt the bottleneck will be in linkerd.

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 20:03 UTC (Mon) by eternaleye (guest, #67051) [Link]

Well, linkerd-tcp is in Rust. Just recently got posted to /r/rust, in fact: https://www.reddit.com/r/rust/comments/62avcc/introducing...

HTTP and microservices

Posted Apr 10, 2017 18:56 UTC (Mon) by fratti (guest, #105722) [Link]

One limitation of linkerd is that it only supports HTTP (and HTTPS) requests; it functions as a web proxy, and not every service is provided that way.

perrynfowler once stated in a talk that "If your 'microservices' make HTTP calls to each other you actually just have a distributed monolith."

I think the core takeaway from that statement is that you're very tempted to just split off arbitrary chunks of a service and make them communicate through synchronous calls through a somewhat general-purpose protocol. In that regard, only using HTTP is quite limiting. I'd even argue that only using TCP is still quite limiting, since some services which could benefit from being more distributed may already use UDP-based protocols for better efficiency, one example is torrent trackers. Especially in those cases, the "if you fail, retry" concept fits very well, since that's already the core idea behind UDP.

HTTP and microservices

Posted Apr 10, 2017 20:05 UTC (Mon) by eternaleye (guest, #67051) [Link]

Alternately, once can use something that allows you to model your API very neatly, while offering more usable semantics, such as Cap'n Proto RPC. Note that Cap'n Proto RPC is asynchronous, though - all operations return promises.

HTTP and microservices

Posted Apr 10, 2017 21:39 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

But you have to get the result of a promise/future eventually. And sometimes you have to do it multiple times while your request traverses the stack, so you're back to requiring the centralized orchestration system to achieve everything that linkerd does.

HTTP and microservices

Posted Apr 12, 2017 1:16 UTC (Wed) by eternaleye (guest, #67051) [Link]

Oh, I'm not suggesting that Cap'n Proto makes a centralized orchestration service unnecessary, or anything of the sort.

Instead, I'm suggesting that using a richer protocol, one that handles asynchrony in a more cohesive manner, can make orchestration systems _more effective_.

Also, your response doesn't really match how Cap'n Proto promises/futures work; think of it more like "sending a request also creates a place to store the response once it is ready, and a way to realize when that happens".

"Getting the result multiple times" doesn't hit the network multiple times at all. Instead, you'd perform a request, the orchestration system instruments and forwards it appropriately, possibly interposing the reply capability, and it Just Works.

Worth noting is that interposition is an _explicitly_ supported behavior in Cap'n Proto, requires vastly less parsing/marshalling than HTTP, and avoids the fate-sharing issues of interposing TCP.

HTTP and microservices

Posted Apr 12, 2017 9:25 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

> Instead, I'm suggesting that using a richer protocol, one that handles asynchrony in a more cohesive manner, can make orchestration systems _more effective_.
Ok. I'll bite.

Cap'n Proto is not a good idea for distributed services, ever. It promotes stateful systems that can't be made robust in the face of network/service degradation. It doesn't deal well with intermittent disconnections and retries. It's only good when you need to implement a (possibly bidirectional) communication channel between two reliable endpoints.

> Also, your response doesn't really match how Cap'n Proto promises/futures work; think of it more like "sending a request also creates a place to store the response once it is ready, and a way to realize when that happens".
I implemented a part of Cap'n Proto wire protocol... Cap'n Proto's capability system does not allow one to send a promise to a third party. It's possible in theory but in practice it'll lead to pain, suffering and CORBA.

Services have to be made as stateless as possible. So you pretty much have to use classic request/reply protocols and HTTP is as good as any in this case.

> "Getting the result multiple times" doesn't hit the network multiple times at all.
It most definitely can happen. Imagine that service A calls service B that calls service C. Service C browns out and service B starts doing retries. That causes the call from A to B to error out and start doing retries. And pretty soon the whole system is locked up doing work that will never get used.

In case of Cap'n Proto errors will simply look like disconnection exceptions.

HTTP and microservices

Posted Apr 12, 2017 21:58 UTC (Wed) by kentonv (✭ supporter ✭, #92073) [Link]

> It promotes stateful systems that can't be made robust in the face of network/service degradation. It doesn't deal well with intermittent disconnections and retries.

No, Cap'n Proto lets you choose your trade-offs to fit the problem. If your problem is fundamentally stateless then go ahead and do request/response like you would with HTTP (but benefit from faster serialization and the fact that you can multiplex on a connection without head-of-line blocking).

If, on the other hand, you have a stateful problem, then in a stateless model you will end up building something that sucks. The typical naive approach is to have every request push back into a database, which means a sequence of state changes will be very slow as you're waiting for an fsync on every one. More involved approaches tend to involve some sort of caching, timeouts, etc. that make everything much more complicated and buggy.

Cap'n Proto lets you express: "Let's set up some state, do a few things to it, then push it back -- and if we have a network failure somewhere in the middle then the server can easily discard that state while the client starts over fresh."

It turns out that while, yes, networks are unstable, they're not nearly as unstable as we are designing for today. We're wasting a whole lot of IOPS and latency designing for networks that are one-9 reliable when what we have is more like five-9's.

Of course, "stateless" HTTP services don't magically mean you don't have to worry about network errors. You need to design all your network interactions to be idempotent, or you need to think about what to do if the connection drops between the request and response. Cap'n Proto is really no different, except that you can more easily batch multiple operations into one interaction.

> Cap'n Proto's capability system does not allow one to send a promise to a third party.

Actually, it does. The level 3 RPC protocol specifies how to forward capabilities (which may be promises) and also how to forward call results directly to a third party.

> It most definitely can happen. Imagine that service A calls service B that calls service C. Service C browns out and service B starts doing retries. That causes the call from A to B to error out and start doing retries. And pretty soon the whole system is locked up doing work that will never get used.

Generally you'll want to let the disconnect exception flow through to the initiator, retrying only at the "top level" rather than at every hop, to avoid storms.

HTTP and microservices

Posted Apr 13, 2017 1:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> No, Cap'n Proto lets you choose your trade-offs to fit the problem. If your problem is fundamentally stateless then go ahead and do request/response like you would with HTTP
Here we're discussing enterprise-scale orchestration systems and in this case stateful systems are pretty much a BadIdea(tm).

> but benefit from faster serialization and the fact that you can multiplex on a connection without head-of-line blocking
HTTP doesn't specify serialization and multiplexing is an anti-pattern for services (it makes sense for applications like browsers where connections are short-lived and every WAN roundtrip counts).

> If, on the other hand, you have a stateful problem, then in a stateless model you will end up building something that sucks. The typical naive approach is to have every request push back into a database, which means a sequence of state changes will be very slow as you're waiting for an fsync on every one.
A real application will get an idempotency token, check if the request has already been processed (for requests with side-effects), write the idempotency token (possibly using it as a lock), proceed with changes, log them, unlock the idempotency token, return the request to the client.

That's for a start, disregarding downstream calls and so on. Hence the first rule of distributed programming: "Don't".

> Cap'n Proto lets you express: "Let's set up some state, do a few things to it, then push it back -- and if we have a network failure somewhere in the middle then the server can easily discard that state while the client starts over fresh."
Cap'n Proto has no support in its protocol for retries and idempotency checks. Its only approach is to throw exceptions and hope that the application does everything else. It's no better than HTTP.

> It turns out that while, yes, networks are unstable, they're not nearly as unstable as we are designing for today. We're wasting a whole lot of IOPS and latency designing for networks that are one-9 reliable when what we have is more like five-9's.
Networks are reliable if you have two servers connected to the same switch. But with Twitter-scale systems "the network" is NOT reliable. You can not make any assumptions that your downstream service will be available without hiccups caused by deployments, route flaps, bugs, brown-outs, throttles and so on.

You HAVE to design with the assumption that your downstream services will randomly fail.

This is absolutely fundamental in large-scale systems. There's no way around it.

> Actually, it does. The level 3 RPC protocol specifies how to forward capabilities (which may be promises) and also how to forward call results directly to a third party.
There's no level 3 RPC for Cap'n Proto. In the current protocol promises are simple 32 bit references valid only within the context of a stream. So it won't scale as-is to multiple systems, as the entire state would have to be transferred.

The alternative is introduction of URL-like constructs that encode the endpoint and the context within it. But at this point you'll just reinvent HTTP and REST.

> Generally you'll want to let the disconnect exception flow through to the initiator, retrying only at the "top level" rather than at every hop, to avoid storms.
Then you have another source of vicious loops: service A does 200 calls to service B. Call 199 fails. Then the top-level system retries the whole request to service A again.

HTTP and microservices

Posted Apr 13, 2017 15:04 UTC (Thu) by kentonv (✭ supporter ✭, #92073) [Link]

> Here we're discussing enterprise-scale orchestration systems and in this case stateful systems are pretty much a BadIdea(tm).

No, you can't just generalize like that. Some use cases are stateful. For example, you can't implement real-time collaboration with stateless services in front of a standard database. You need a coordinator service for operational transforms.

I'm not sure we're talking about the same thing when you say "enterprise-scale orchestration system", but having written an orchestration system from scratch I'd say it's a pretty stateful problem. You can't start up a new container for every request, after all.

> HTTP doesn't specify serialization

It does for the headers. And transfer-encoding: chunked is pretty ugly, too.

> and multiplexing is an anti-pattern for services (it makes sense for applications like browsers where connections are short-lived and every WAN roundtrip counts).

The ability to do multiple independent requests in parallel is an anti-pattern?

> A real application will get an idempotency token, check if the request has already been processed (for requests with side-effects), write the idempotency token (possibly using it as a lock), proceed with changes, log them, unlock the idempotency token, return the request to the client.

That sounds over-engineered. For most apps you don't need two-phase commit on every operation.

But if you do, this goes back to my point: All those steps make the operation take a long time, and if you have to do it again for every subsequent operation, it's going to be very slow.

> Cap'n Proto has no support in its protocol for retries and idempotency checks. Its only approach is to throw exceptions and hope that the application does everything else.

Yes. That is the correct thing to do.

> You HAVE to design with the assumption that your downstream services will randomly fail.

Of course you do. I never said otherwise.

But do they fail 1/10 of the time or 1/10000 of the time? These call for different optimization trade-offs.

> There's no level 3 RPC for Cap'n Proto.

The protocol is defined but it's true that 3-party handoff hasn't been implemented yet. (Though it's based on CapTP, which has been implemented.)

> In the current protocol promises are simple 32 bit references valid only within the context of a stream. So it won't scale as-is to multiple systems, as the entire state would have to be transferred.

You mean the current implementation. I don't know what you mean by "the entire state would have to be transferred", but currently in three-party interactions there tends to be proxying.

> Then you have another source of vicious loops: service A does 200 calls to service B. Call 199 fails. Then the top-level system retries the whole request to service A again.

Sure, you should use good judgment in deciding where to retry. This is why the infrastructure can't do it automatically -- it's almost never the right place to retry. Retrying in your network library is just another version of trying to hide network unreliability from apps.

HTTP and microservices

Posted Apr 13, 2017 21:09 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

> No, you can't just generalize like that. Some use cases are stateful. For example, you can't implement real-time collaboration with stateless services in front of a standard database. You need a coordinator service for operational transforms.
Most modern enterprise systems are mostly-stateless - a server typically retrieves state required for a user request from some kind of a storage/cache subsystem for every request. And even storage subsystems themselves are usually "stateless" - they don't keep long-lived sessions with clients.

> The ability to do multiple independent requests in parallel is an anti-pattern?
Correct. Multiple unrelated requests inside one TCP stream is a bad idea in general - it defeats the OS-level flow control logic, may have problems with head-of-line blocking and it has other issues. It makes sense when you want to avoid the overhead of making additional round-trips for TCP's triple handshake.

> That sounds over-engineered. For most apps you don't need two-phase commit on every operation.
Nope. It's pretty much a required workflow if you need to involve multiple services.

> But if you do, this goes back to my point: All those steps make the operation take a long time, and if you have to do it again for every subsequent operation, it's going to be very slow.
Only for mutating operations, though.

> Of course you do. I never said otherwise.
> But do they fail 1/10 of the time or 1/10000 of the time? These call for different optimization trade-offs.
You have to design for 1/10 failure rate (at least!) if you want your service to be resilient.

> You mean the current implementation. I don't know what you mean by "the entire state would have to be transferred", but currently in three-party interactions there tends to be proxying.
This means that you're reimplementing the highly stateful ORB from CORBA. History never teaches people...

And no, linkerd is not stateful. It does not have to track the content of passed data, only the overall streams.

> Sure, you should use good judgment in deciding where to retry. This is why the infrastructure can't do it automatically -- it's almost never the right place to retry.
And how do you decide that you should stop doing retries because the overall global call rate is spiking?

And these issues are not theoretical. For a real-world example of a retry-driven vicious loop you can read this: https://aws.amazon.com/message/5467D2/

HTTP and microservices

Posted Apr 13, 2017 21:33 UTC (Thu) by kentonv (✭ supporter ✭, #92073) [Link]

Seems our debate has been reduced to "nuh-uh" vs. "uh-huh", with both of us presuming ourselves to be more knowledgeable/authoritative than the other.

HTTP and microservices

Posted Apr 14, 2017 17:06 UTC (Fri) by lulf (subscriber, #96369) [Link]

Alternatively use a standards based messaging protocol like AMQP:

http://www.amqp.org/
http://qpid.apache.org/

HTTP and microservices

Posted Apr 12, 2017 3:22 UTC (Wed) by ssmith32 (subscriber, #72404) [Link]

I watched a little bit of the video - since the link started in the middle.
And how does he justify this statement? Particularly with respect to the web?

And given that he basically pitches the Enterprise Service Bus as the alternative model, I'm not sure why anyone should care? We should run the Web as a distributed app on top of some global queue. The world wide Rabbit?

Every place I've been at that went for tightly coupled services joined over something like thrift, etc,... let's just say there are areas of regret. It is needed in some places, but http and trying to be RESTful (I haven't seen anyone succeeded, but trying helps), is a vastly better place to start than tightly coupled protocols and IDL's.

It is by no means perfect, but the Web is an amazingly successful distributed system, and you need a little more than a diagram with little poop emojis to present a credible argument against following in it's footsteps.

HTTP and microservices

Posted Apr 12, 2017 9:29 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

His analogy is a bit odd, but he's pinpointed a real problem - a distributed system can very well exhibit vicious loops caused by its own resiliency algorithms. A some sort of external overarching overseer system that can suppress them is really needed.

It has no real bearing on RESTfulness vs. classic RPC. I've seen retry-driven meltdowns in very REST-ful share-nothing microservice-based systems.

HTTP and microservices

Posted Apr 13, 2017 6:46 UTC (Thu) by ssmith32 (subscriber, #72404) [Link]

Ah ok, I watched the video from the point linked to, and until my next stop on the train. I guess if you're worried about some vicious loops... maybe?

I've never really seen *vicious* loops in prod systems. And the loops I've seen certainly never been worth tying your architecture to something like a service bus. Just dumb mistakes. I have seen a vicious loop relying on a subsystem of a generally restfulish system that had a daemon pulling a corrupted message off rabbit, failing to route the message (because it was corrupted) then some lame spring lib used in the component just put the message back, to be "transactional", pulled the same message back off again...Yeah.

To top it off, this is in a system where we can lose a message here and there and it's fine.

The reason given for the logic was "well then the rabbit queue backs up until it triggers an alert, and someone has to manually delete the message, so it's a *good* design, because it alerts us to corrupted message!". Instead of, I dunno, just alerting on corrupted messages??

Little poop emojis, and vicious cycles, that I've seen, are not solved by some grand design. They're solved by limiting the poop in your implementation of a design.

Thanks for the clarification, though, at least I get the logic, even if I disagree.

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 19:12 UTC (Mon) by federico3 (guest, #101963) [Link]

> service discovery, load balancing, encryption, tracing and logging, handling retries, expiration and timeouts, back-offs, dynamic routing, and metrics

Replacing a big monolithic application with microservices and a big, complex, monolithic load balancer?

Perhaps those functions should be implemented by independent, modular libraries and daemons.

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 21:50 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

How exactly? My $BIGCORP employer has an exactly same problem - a simple user-initiated request has to wind its way through many services (that I won't really call 'micro' in many cases) and the overall emergent behavior is often surprising.

Some of it can be mitigated in a decentralized way by over-provisioning and by making sure that retry policies are not provoking vicious loops (when most services are stuck retrying requests that have long since timed out on an upper level). But that can only be done to a certain degree.

My only personal objection coming from high-traffic service experience, is that this service should be designed as a metadata side-channel and leave the actual HTTP (or whatever) requests to individual services.

Connecting Kubernetes services with linkerd

Posted Apr 10, 2017 22:15 UTC (Mon) by nix (subscriber, #2304) [Link]

That's exactly what this is for -- allowing ways for lots of daemons to talk to each other while imposing policies on all of them that are inapplicable to single daemons alone (like maximum retry counts and global timeouts), and not requiring them all to replicate all the communications infrastructure.

So they're called 'microservices', but that's just a wizzy name for a small daemon that talks over the network, really.

Connecting Kubernetes services with linkerd

Posted Apr 11, 2017 4:04 UTC (Tue) by drag (guest, #31333) [Link]

> So they're called 'microservices', but that's just a wizzy name for a small daemon that talks over the network, really.

That's definitely all it really is.

It's the classic 'thread' vs 'fork' approach. A single process with lots of threads and shared memory and all that jazz versus a bunch of relatively small processes trying to do the same job. Microservices is just forking processes over a bunch of systems and using TCP/IP for IPC. Same limitations and problems apply as before as well as the challenge of coordinating this stuff over multiple Linux instances.

Connecting Kubernetes services with linkerd

Posted Apr 11, 2017 5:33 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

Well, if all your services are using the same framework then you can do global timeouts by propagating "request deadline timestamp" and/or time budget to your downstream services and checking that there's still some time left in the upstream services when they get back the results.

More advanced stuff like intelligent request distribution and global throttling are not really possible.