Jekyll2021-08-05T18:57:16+00:00http://robharrop.github.io/feed.xmlRob HarropRob Harrop, CTO @ Skipjaq. Writing about coding, maths, distributed systems, performance optimisation and machine learning
Inside Queue Models: Markov Chains2016-04-27T15:00:00+00:002016-04-27T15:00:00+00:00http://robharrop.github.io/maths/performance/2016/04/27/queue-models-markov-chains<p>So far in this series on queueing theory, we’ve seen <a href="/maths/performance/2016/02/20/service-latency-and-utilisation.html">single server queues</a>,
<a href="/maths/performance/2016/02/27/finite-queue-latency.html">bounded queues</a>, <a href="/maths/performance/2016/03/07/multi-server-queues.html">multi-server queues</a>, and most recently <a href="/maths/performance/2016/03/15/queue-networks.html">queue
networks</a>. A fascinating result from queueing theory is that wait time
degrades significantly as utilisation tends towards 100%. We saw that \(M/M/c\)
queues, which are unbounded, have degenerate behaviour under heavy load when
utilisation hits dangerous levels.</p>
<p>Perhaps more interestingly, we saw that bounded \(M/M/c/k\) queues limit
customer ingress to prevent this undesirable behaviour; in essence we learned
that we can reject some customers to ensure good service for those who <em>do</em> make
it into the system. Those customers who get rejected might not be entirely
happy with this arrangement, so the question is: can we do better?</p>
<p>Over the course of the next two entries, I want to dig deeper into the internals
of queuing models so that we can explore sophisticated ways to better capture
how modern systems behave. The ultimate goal of this series is to learn how we
can build models of <a href="http://www.reactivemanifesto.org/">reactive systems</a>, paying particular attention to how we
model <strong>back pressure</strong> in those systems.</p>
<p>A reactive system, using back pressure, signals to its source of traffic when
it’s ready for more customers. At first glance, this might sound incredibly
impractical, but in practice back pressure works for any system that is
completely in control of both the traffic source and the processing service.</p>
<p>For a typical service-based architecture, back pressure is the perfect mechanism
for regulating traffic flow between services. One can imagine that as services
<em>pull</em> more traffic in from their upstream provider, this <em>pull</em> propagates
towards the outside of the system until finally it hits the boundary where
traffic is coming from the outside world. Even here, back pressure can be
applied to some level. Both TCP and HTTP traffic can be limited to a certain
degree with back-pressure. Eventually though, back pressure will no longer
suffice to limit resource usage and we’ll need to start dropping customers.</p>
<p>In practice then, a reactive system bounds all in-flight processing, but uses
back pressure to regulate the amount of in-flight work, and thus, reduce the
number of cases where work must be rejected. We can model queues with
back-pressure by replacing the Poisson arrival process used by all \(M/*/*\)
queues with something more sophisticated.</p>
<p>Before we look at other arrival processes though, we should first ensure that we
really understand how a simple queue, like an \(M/M/1\) queue, really functions.
In particular, we are interested in analysing our queue as a particular type of
<a href="https://en.wikipedia.org/wiki/Continuous-time_Markov_chain">continuous time Markov chain</a> called a <a href="https://en.wikipedia.org/wiki/Birth%E2%80%93death_process">birth-death process</a>.</p>
<h2 id="a-quick-recap">A Quick Recap</h2>
<p>Before we proceed, let’s remind ourselves of the basics of \(M/M/c\) queue
models. Arrivals into the queue are modelled as a Poisson process where the
arrival rate is designated \(\lambda\). Service times have rate \(\mu\) and
are exponentially-distributed with mean service time of \(1 / \mu\).</p>
<p>The ratio of arrival to service completion \(\lambda / \mu\) is denoted
\(\rho\). For unbounded \(M/M/c\) queues, \(\rho < c\) ensures that the queue is
stable, if \(\rho \geq c\), then both queue size and latency tend towards
infinity.</p>
<h2 id="markov-chains-in-two-minutes">Markov Chains in Two Minutes</h2>
<p>A Markov chain is a random process described by states and the transitions
between those states. Transitions between states are probabilistic and exhibit a
property called <em>memorylessness</em>. The memorylessness property ensures that the
probability distribution for the next state depends only on the current state.
Put another way, the history of a Markov process is unimportant when considering
what the next transition will be.</p>
<p><img src="/assets/markov-chains/dtmc.png" alt="Simple Markov chain example" /></p>
<p>The diagram above shows a simple Markov chain with three states: <em>in bed</em>, <em>at
the gym</em> and <em>at work</em>. The transitions between each state to the next state are
labelled with the respective probabilities. For example, the probability of
going from <em>in bed</em> to <em>at work</em> is 30%. Note also, that the probability of
remaining <em>in bed</em> is 20%; there’s no requirement that we actually leave the
current state.</p>
<p>We can represent these transition probabilities using a transition probability
matrix \(P\):</p>
\[\begin{bmatrix}
0.2 & 0.5 & 0.3 \\
0.1 & 0.2 & 0.7 \\
0.4 & 0.1 & 0.5 \\
\end{bmatrix}\]
<p>The probability of moving from state \(i\) to state \(j\) is given by
\(P_{ij}\). Each row in the matrix must sum to \(1\) indicating that the
probability of doing <em>something</em> when in a given state is always \(100%\).</p>
<p>This kind of Markov chain is called a <em>discrete-time Markov chain</em> (DTMC), where
the time parameter is discrete and the state changes randomly between each
discrete step in the process. The models we’ve seen so far have a continuous
time parameter resulting in <em>continuous-time Markov chains</em> (CTMC).</p>
<p>We can recast our discrete-time process as a continuous-time process. We use a
slightly different representation for our continous-time chains. Rather than
modelling the transition probabilities, we model the <em>transition rates</em>:</p>
<p><img src="/assets/markov-chains/ctmc.png" alt="Continuous-time Markov chain example" /></p>
<p>Note that we omit rates for staying in the same state: it makes little sense to
talk about the rate at which a process remains stationary. Just as we used a
transition probability matrix for the discrete-time chain, we use a transition
rate matrix \(Q\) for the continuous-time chain:</p>
\[\begin{bmatrix}
-0.8 & 0.5 & 0.3 \\
0.1 & -0.8 & 0.7 \\
0.4 & -0.9 & 0.5 \\
\end{bmatrix}\]
<p>Here, \(Q_{ij}\) is the <em>rate</em> of transition from state \(i\) to state \(j\).
Diagonals (\(Q_{ii}\)) are constructed such that each row equals \(0\) unlike
the diagonals for the transition probability matrix, which ensure that each row
equals \(1\). The diagram and the matrix show that our continuous-time chain
moves from the <em>in bed</em> state to the <em>at the gym</em> state (\(Q_{01}\)) with rate
\(0.5\).</p>
<h2 id="poisson-processes">Poisson Processes</h2>
<p>Now we understand how to construct continuous-time Markov chains we can explore
Markovian queues in more detail. Recall that for an \(M/M/c\) queue, both
arrivals and service times are Poisson processes, that is they are both
stochastic processes with Poisson distribution.</p>
<p>We can model a Poisson process, and thus the arrivals and service processes, as
a CTMC where each state in the chain corresponds to a given population size.
Consider the arrivals process in an \(M/M/1\) queue. We know that arrivals are a
Poisson process with rate \(\lambda\). At the start of the process, there have
been no arrivals. Thi first arrival occurs with rate \(\lambda\), so to second,
the third and so on for as long as the process continues. We can model this as a
Markov chain where the states correspond to the arrivals count:</p>
<p><img src="/assets/markov-chains/birth-death.png" alt="Markov chain of birth-death process" /></p>
<p>When we translate this into a transition rate matrix we get:</p>
\[\begin{bmatrix}
-\lambda & \lambda & 0 & 0 \\
0 & -\lambda & \lambda & 0 \\
0 & 0 & -\lambda & \lambda \\
& & & & \ddots \\
\end{bmatrix}\]
<p>This matrix continues unbounded since the number of arrivals is effectively
unbounded.</p>
<h2 id="birth-death-processes">Birth-Death Processes</h2>
<p>An \(M/M/c\) queue is composed of two Poisson processes working in tandem: the
arrivals process and the service process. As we saw, each of these processes can
be described by a Markov chain. We can go further and describe the queue as a
whole using a special kind of Markov chain process called a <strong>birth-death
process</strong>. Birth-death processes are processes where the states represent the
population count and transitions correspond to either <strong>births</strong>, which
increment the population count by one, or <strong>deaths</strong> which decrease the
population count by one. Note that Poisson processes are themselves birth-death
processes, just with zero deaths.</p>
<p>This diagram shows the Markov chain for an \(M/M/1\) queue with arrival rate
\(\lambda\) and service rate \(\mu\). As you can see, the population state
increases as customers arrive at the queue and decreases as customers are
served. We can translate this simple diagram into a transition rate matrix for
the queue:</p>
\[\begin{bmatrix}
-\lambda & \lambda & 0 & 0 \\
\mu & -(\mu + \lambda) & \lambda & 0 \\
0 & \mu & -(\mu + \lambda) & \lambda \\
& & & & \ddots \\
\end{bmatrix}\]
<p>When the process starts, the only possible transition is from zero customers to
one with rate \(\lambda\) (\(Q_{01} = \lambda\)). After this, at each state, the
process can transition to having one more customer, again at rate \(\lambda\) or
to having one fewer customer with rate \(\mu\).</p>
<h2 id="steady-state-probabilities">Steady-State Probabilities</h2>
<p>With the transition rate matrix in hand, we can calculate the steady-state
probabilities \(p_k\) for the \(M/M/1\) queue. Recall that the steady-state
probabilities \(p_k\) tell us the probability of the queue being in state \(k\),
that is the probability of having \(k\) customers in the system. More formally:</p>
\[p_k = \lim_{t \to \infty} P_k(t)\]
<p>Where \(P_k(t)\) is the probability of having \(k\) customers in the system at
time \(t\). Note that the steady-state probabilities are time-independent and,
as the name implies, steady. More precisely, we expect that:</p>
\[\lim_{t \to \infty} P'_{k}(t) = 0\]
<p>That is, we expect the rate of change of the probabilities to be zero in the
limit. Let’s think about \(P'_k(t)\) for a while. The transition rate matrix
tells us how the process flows between states. We can see that each state \(k\)
can be entered from states \(k-1\) and state \(k+1\). Entry from state \(k-1\)
corresponds to a customer arriving in the system and has rate \(\lambda\). Entry
from state \(k+1\) corresponds to a customer completing service and leaving the
system with rate \(\mu\).</p>
<p>Each state \(k\) can also exit to states \(k-1\) and \(k+1\) as customers are
served (with rate \(\mu\)) and arrive (with rate \(\lambda\)). This gives us:</p>
\[P'_k(t) = \lambda P_{k-1}(t) + \mu P_{k+1}(t) - \lambda P_k(t) - \mu P_k(t)\]
<p>Using our limit condition \(\lim_{t \to \infty} P'_{k}(t) = 0\) we find these
steady-state flow equations:</p>
\[\begin{align}
0 &= p_0 = \mu p_1 - \lambda p_0 \\
0 &= p_k = \lambda p_{k-1} + \mu p_{k+1} - \lambda p_k - \mu p_k
\end{align}\]
<p>Solving this recurrence relation with dependence on \(p_0\) gives us:</p>
\[p_k = \Big( \frac{\lambda}{\mu} \Big)^k p_0\]
<p>Since we know that all probabilites must sum to \(1\) we can derive \(p_0\):</p>
\[\begin{align}
1 &= p_0 + \sum_{k=1}^{\infty} p_k \\
1 &= p_0 + \sum_{k=1}^{\infty} \Big( \frac{\lambda}{\mu} \Big)^k p_0 \\
1 &= p_0 \Bigg(\sum_{k=1}^{\infty} \Big( \frac{\lambda}{\mu} \Big)^k \Bigg) \\
1 &= p_0 \frac{1}{1 - \frac{\lambda}{\mu}} \\
p_0 &= 1 - \frac{\lambda}{\mu} \\
p_0 &= 1 - \rho \\
\end{align}\]
<h2 id="coming-full-circle">Coming full circle</h2>
<p>You might recall that, in my first post in this series, I mentioned that the
equation for the mean number of customers in an \(M/M/1\) queue follows from
the steady-state probabilities. Let’s see how that works. The mean number
of customers \(L\) for an \(M/M/1\) queue is:</p>
\[L = \frac{\rho}{1 - \rho}\]
<p>To get here from the steady-state probabilities let’s start by simply defining
\(L\) in terms of \(p_k\):</p>
\[L = \sum_{k = 0}^{\infty}k \cdot p_k\]
<p>We’re saying that the mean numbers of customers is simply the sum of each
possible value adjusted by its probability. Let’s expand on this:</p>
\[\begin{align}
L &= \sum_{k = 0}^{\infty}k \cdot \Big( \frac{\lambda}{\mu} \Big)^k p_0 \\
L &= \sum_{k = 0}^{\infty}k \cdot \rho^k p_0 \\
L &= \sum_{k = 0}^{\infty}k \cdot \rho^k \cdot (1 - \rho) \\
L &= (1 - \rho) \cdot \sum_{k = 0}^{\infty}k \rho^k \\
\end{align}\]
<p>We know that \(M/M/1\) queues have divergent behaviour if \(\rho \geq 1\), and
indeed the series \(\sum_{k = 0}^{\infty}k \rho^k\), only converges
for \(\rho < 1\). So, assuming we have \(\rho < 1\)
(otherwise \(L\) is undefined):</p>
\[\begin{align}
L &= (1 - \rho) \frac{\rho}{(\rho - 1)^2} \\
L &= \frac{\rho - \rho^2}{(\rho - 1)^2} \\
L &= \frac{-\rho(\rho - 1)}{(\rho - 1)^2} \\
L &= \frac{-\rho}{\rho - 1} \\
L &= \frac{\rho}{1 - \rho} \\
\end{align}\]
<p>And thus we arrive at the definition for \(L\), the mean customers in the
queue for \(M/M/1\) queues.</p>
<h2 id="whats-next">What’s next?</h2>
<p>With an understanding of how Markov chains are used to construct queue models,
we can start looking at some more complex models. In particular, the next post
in this series will introduce Markov-modulated Arrival Processes (MMAP). An MMAP
composes two or more Markov arrival processes and switches between them. The
switching is itself modelled as a Markov chain. MMAPs are a great way of
creating a rudimentary model of how back-pressure works.</p>So far in this series on queueing theory, we’ve seen single server queues, bounded queues, multi-server queues, and most recently queue networks. A fascinating result from queueing theory is that wait time degrades significantly as utilisation tends towards 100%. We saw that \(M/M/c\) queues, which are unbounded, have degenerate behaviour under heavy load when utilisation hits dangerous levels.Queue Networks2016-03-15T07:00:00+00:002016-03-15T07:00:00+00:00http://robharrop.github.io/maths/performance/2016/03/15/queue-networks<p>In my <a href="/maths/performance/2016/03/07/multi-server-queues.html">previous post</a> I presented the \(M/M/c\) queue as a model for
multi-server architectures. As discussed at the end of that post, the
\(M/M/c\) model has two main drawbacks: each server must have the same
service rate, and there’s no mechanism for modelling the overhead of
routing between servers. Modelling a multi-server system using a single
queue - even a queue with multiple servers - ignores important real-world
system characteristics. In this post, I’ll explain how we can arrange
queues into networks that capture the cost of routing and allow for
servers with different service rates.</p>
<h2 id="open-jackson-networks">Open Jackson Networks</h2>
<p>We’re going to concern ourselves with a particular class of queue network
called <strong>open Jackson networks</strong>. The ‘open’ in the name refers to the
fact that customers arrive from outside the system much like the queues
we’ve seen so far. In a closed Jackson network, there are no arrivals from
the outside and customers never leave the system; in other words the
amount of work in the system is constant.</p>
<p>The most interesting characteristic of Jackson networks is that they have
a product form solution for the steady-state distribution. This is
a rather grand way of saying that we can calculate the steady-state
distribution of the network by treating each queue as if it were operating
independently. For Jackson’s theorem to apply, all routing between queues
in the network must be <em>Markovian</em>, that is routing of customers between
nodes in the network is <em>probabilistic</em>.</p>
<p>At first glance, the requirement to route probabilistically might seem
rather restrictive, but in reality it merely requires a small change
in mindset. If our real world system routes traffic between \(n\) servers
in a round-robin fashion, then our model can route between \(n\) queues
with probability \(1/n\).</p>
<h2 id="modelling-load-balanced-servers">Modelling Load Balanced Servers</h2>
<p>To get a better understanding for Jackson networks, let’s consider
a concrete example of two servers operating behind a load balancer:</p>
<p><img src="/assets/queue-networks/network.png" alt="Simple queue network" /></p>
<p>Here you can see that traffic arrives at the load balancer (\(s_{1}\))
from the outside with rate \(\lambda = 500\) and is then routed between
each of the servers (\(s_{2}\) and \(s_{3}\)) with probability \(1/2\).</p>
<p>The probability that a customer leaves queue \(i\) and enters queue \(j\)
is \(p_{ij}\). We use the index \(0\) to represent the outside world, so
\(p_{0j}\) is the probability that a job enters queue \(j\) from the
outside world and \(p_{j0}\) is the probability that a job leaves queue
\(j\) for the outside world.</p>
<p>We can represent these routing probabilities as a matrix:</p>
\[\begin{bmatrix}
0 & 1 & 0 & 0 \\
0 & 0 & 1/2 & 1/2 \\
1 & 0 & 0 & 0 \\
1 & 0 & 0 & 0 \\
\end{bmatrix}\]
<p>We see from the matrix that \(p_{01}\), the probability that a job enters
queue \(1\) from outside the system is \(1\) and the probability that jobs move
from queue \(1\) to either queue \(2\) or queue \(3\) is \(1/2\).</p>
<p>Jackson’s theorem tells us that, provided we have Markovian routing, and
that each queue has its own well-defined steady state, then the whole
network has a well-defined steady-state distribution. Furthermore, the
product form rule tells us the network’s steady-state distribution:</p>
\[p(\mathbf{n}) = \prod_{i=1}^{J} p_{i}(n_{i})\]
<p>The important point here is that each queue must have a well-defined
steady state. If not, then the product form rule does not apply. So then,
how do we determine if each queue in a network has a well-defined steady
state? For that we need to calculate the flow balance equations.</p>
<h3 id="flow-balance">Flow Balance</h3>
<p>The flow balance equations for a network with \(J\) queues is a set of
\(J\) equations that we can solve to find the effective arrival rate
\(\lambda_{i}\) at each queue \(i\).</p>
<p>Looking back at our matrix of routing probabilities, it should be clear
that any queue can receive customers from the outside world, but also that
customers can flow in cycles through the network. Nothing in the Jackson
model requires that the network is acyclic. Thus the effective arrival
rate for each queue must account for arrivals from outside and for
arrivals from all other queues within the network.</p>
<p>More formally, the flow balance equations for a Jackson network with \(J\)
nodes are:</p>
\[\lambda_j = \lambda_{0j} + \sum_{i=1}^{J} \lambda_i \cdot p_{ij}\]
<p>Working through this we see that the effective arrival rate at each queue
\(j\) is \(\lambda_j\), the sum of all arrivals from other queues,
adjusted by the corresponding routing probability, plus the arrivals from
outside the system.</p>
<p>Our sample network is a special case: a feed forward network. In
a feed forward network, the network must be acyclic and customers cannot
appear in the same queue more than once. Calculating flow balance for such
networks is greatly simplified as we can see by working through the flow
balance for each queue:</p>
\[\begin{align}
\lambda_1 &= \lambda \\
\lambda_2 &= 1/2 \lambda \\
\lambda_3 &= 1/2 \lambda
\end{align}\]
<p>Recall from <a href="/maths/performance/2016/02/20/service-latency-and-utilisation.html">the discussion of \(M/M/1\) queues</a> that the stability
condition for \(M/M/1\) is \(\rho = \lambda / \mu < 1\). If this
condition holds for each of our queues, then we know that our network has
a steady-state distribution given by Jackson’s theorem. Using the service
rates from the diagram above we can calculate \(\rho\) for each of our
queues:</p>
\[\begin{align}
\rho_1 &= \lambda_1 / \mu_1 = 500 / 800 = 0.625 \\
\rho_2 &= \lambda_2 / \mu_2 = 250 / 500 = 0.5 \\
\rho_3 &= \lambda_3 / \mu_3 = 250 / 500 = 0.5
\end{align}\]
<p>Since \(\rho < 1\) for all our queues we know that each queue is stable and
thus the network is stable.</p>
<h3 id="steady-state-probabilities">Steady-State Probabilities</h3>
<p>With the knowledge that our network has a well-defined steady state, we can
apply Jackson’s theorem to calculate the steady-state probabilities for our
network.</p>
<p>The steady-state probability for an \(M/M/1\) queue is \(p(n) = (1 - \rho)
\rho^n\). Applying the product rule for Jackson network we get:</p>
\[\begin{align}
p(\mathbf{n}) &= (1 - \rho_1) (1 - \rho_2) (1 - \rho_3) \rho_1^{n_1} \rho_2^{n_2} \rho_3^{n_3} \\
&= 0.375 \cdot 0.5 \cdot 0.5 \cdot 0.625^{n_1} \cdot 0.5^{n_2} \cdot 0.5^{n_3} \\
&= 0.09375 \cdot 0.625^{n_1} \cdot 0.5^{n_2} \cdot 0.5^{n_3} \\
\end{align}\]
<p>Let’s now calculate the probability that we have two customers at each of
the queues, that is, let’s calculate \(p(\langle 2, 2, 2 \rangle)\):</p>
\[\begin{align}
p(\langle 2, 2, 2 \rangle) &= 0.09375 \cdot 0.625^{n_1} \cdot 0.5^{n_2} \cdot 0.5^{n_3} \\
&\approx 0.0022888
\end{align}\]
<h2 id="latency-of-queue-networks">Latency of Queue Networks</h2>
<p>As with the queue models we’ve seen so far, the steady-state probabilities
are not that interesting on their own. Rather, the results that follow
from these probabilities are what interest us. To determine the average
latency for the network \(W_{net}\) recall Little’s Law:</p>
\[L = \lambda W\]
<p>The mean number of customers \(L\) is equal to the arrival rate \(\lambda\)
multiplied by the mean latency \(W\). We know the arrival rate for our network,
so we if can calculate the mean number of customers in the network, the latency
will follow. Since we are able to consider each queue in isolation after
solving the flow balance equations, it’s enough to calculate the mean number
of customers for each queue and then sum them:</p>
\[L_{net} = \sum_{i=1}^{J}L_i = \sum_{i=1}^{J} \frac{\rho_i}{1 - \rho_i}\]
<p>For our network:</p>
\[\begin{align}
L_{net} &= \frac{\rho_1}{1 - \rho_1} + \frac{\rho_2}{1 - \rho_2} + \frac{\rho_3}{1 - \rho_3} \\
&= \frac{0.625}{0.375} + \frac{0.5}{0.5} + \frac{0.5}{0.5} \\
&\approx 3.6667
\end{align}\]
<p>With \(L_{net}\) in hand, we can now calculate the latency \(W_{net}\) for
our network:</p>
\[\begin{align}
W_{net} &= \frac{L_{net}}{\lambda} \\
&\approx \frac{3.6667}{500} \\
&\approx 0.0073334
\end{align}\]
<p>Coarse-grained results such as average wait time and average occupancy
gives us rough insight into our queue networks. We can gain better insight
using simulation tools such as <a href="http://www.perfdynamics.com/Tools/PDQ.html">PDQ</a> and <a href="http://simjs.com/queuing/index.html">SimJS</a>. SimJS provides
a drag-and-drop interface for designing queue networks and can simulate
many hours of queue activity in a handful of minutes.</p>
<p>I plan to write about network simulation more in a later entry, but for
now I recommend you try out SimJS.</p>
<h3 id="conclusion">Conclusion</h3>
<p>Queue networks are a useful tool for modelling complex distributed
applications. We gain the simplicity of Jackson networks provided we
ensure Markovian routing throughout our model. If our network is free from
cycles, calculating flow balance is simply a case of tracing traffic from
the entrypoints of the network all the way through to the exit points.</p>
<p>When modelling your own systems using queue theory, prefer network models
over \(M/M/c\) models. Networks afford the flexibility to model varying
service rates across the servers in the network, and provide a means to
model the overhead of traffic routing.</p>In my previous post I presented the \(M/M/c\) queue as a model for multi-server architectures. As discussed at the end of that post, the \(M/M/c\) model has two main drawbacks: each server must have the same service rate, and there’s no mechanism for modelling the overhead of routing between servers. Modelling a multi-server system using a single queue - even a queue with multiple servers - ignores important real-world system characteristics. In this post, I’ll explain how we can arrange queues into networks that capture the cost of routing and allow for servers with different service rates.Modelling Multi-Server Queues2016-03-07T07:00:00+00:002016-03-07T07:00:00+00:00http://robharrop.github.io/maths/performance/2016/03/07/multi-server-queues<p>A few questions seem to come up again and again from the people who’ve
been reading my posts on queue theory. Perhaps, the most common question
is: “How do I model multi-server applications using queues?”. This in an
excellent question since most of us will be running production systems
with more than one server, be that multiple collaborating services or just
a simple load-balanced service that has a few servers sharing the same
incoming queue of customers.</p>
<p>In this post, I want to address the simplest model for multiple servers:
the \(M/M/c\) queue. Like the \(M/M/1\) queue I described in an <a href="/maths/performance/2016/02/20/service-latency-and-utilisation.html">earlier
post</a>, the \(M/M/c\) queue has inter-arrival times
exponentially-distributed with rate \(\lambda\), and service rate
exponentially-distributed with rate \(\mu\). The difference, which should
be obvious, is that rather than having just one server, we can have any
positive number.</p>
<p>The measure of traffic intensity for \(M/M/1\) and \(M/M/c\) queues is
\(\rho = \lambda / \mu\). For \(M/M/1\) queues \(\rho\) is also the
measure of utilisation, but for \(M/M/c\) queues we have utilisation \(a
= \rho / c\). The stability condition for \(M/M/c\) queues is \(a = \rho
/ c < 1\).</p>
<h2 id="what-to-model">What to model?</h2>
<p>One of the most important questions we can answer is: what should be
modelled as a multi-server queue? One reader asked whether
a multi-threaded server is best modelled using an \(M/M/c\) queue with
\(c\) equal to the number of threads. This is a tough question, but to
answer we should consider the requirement that, for an \(M/M/c\) queue,
each of the servers must be indendent.</p>
<p>If we are modelling a coarse-grained service like a web server, then
I think there’s enough interference between the threads to model each
server process as an \(M/M/1\) queue rather than as an \(M/M/c\) process.
Indeed, we might even go further and model each distinct <strong>machine</strong> as an
\(M/M/1\) queue, and only use an \(M/M/c\) queue to model multiple
machines serving the same stream of customers.</p>
<p>If we were modelling a low-level component like a thread scheduler, then
we would likely use an \(M/M/c\) queue, with \(c\) equal to the number of
CPUs, but at the coarse granularity of a web server, we can safely ignore
the number of CPUs and threads and use an \(M/M/1\) queue.</p>
<h2 id="steady-state-probabilities">Steady-State Probabilities</h2>
<p>We’ll calculate the average latency of \(M/M/c\) queues from the
steady-state probabilities. As I did in the previous entries, I’m not
going to discuss the derivation of these probablities (although I promise
to do this in an upcoming post). Remember that the steady-state
probabilities \(p_{n}\) tell us the probability of there being \(n\)
customers in the system. We’ll start with \(p_{0}\):</p>
\[p_{0}
= \Bigg(\sum_{n=0}^{c-1}\frac{\rho^{n}}{n!}+\Big(\frac{\rho^{c}}{c!}\Big)\Big(\frac{1}{1 - a}\Big)\Bigg)^{-1}\]
<p>For \(n \geq 1\), we must account for two scenarios: when the number
of customers is less than the number of servers (\(n < c\)), and when the
number of customers is greater than or equal to the number of servers (\(n
\geq c\)):</p>
\[p_{n} =
\begin{cases}
p_{0} \frac{\rho^{n}}{n!} & n < c \\
p_{0} \frac{a^{n}c^{c}}{c!} & n \geq c \\
\end{cases}\]
<h2 id="probability-of-waiting">Probability of Waiting</h2>
<p>Since we have Poisson arrivals, <a href="https://en.wikipedia.org/wiki/Arrival_theorem">we can calculate</a> the probability that
a customer has to wait, by summing \(p_{n}\) starting at \(c\) and
proceeding to \(\infty\): \(p_{queue} = \sum_{n=c}^{\infty}p_{n}\). The
expanded form of this is called <a href="https://en.wikipedia.org/wiki/Erlang_(unit)#Erlang_C_formula">Erlang’s C Formula</a>:</p>
\[C(c, \rho) = \frac{\frac{\rho^c}{c!}\frac{c}{c - \rho}}{\sum_{n=0}^{c-1}\frac{\rho^{n}}{n!} + \frac{\rho^{c}c}{c!(c - \rho)}}\]
<p>If we plot this function for different values of \(c\), we can easily
see how adding more servers to our system reduces the likelihood a
customer will have to wait:</p>
<p><img src="/assets/figures/posts/2016-03-07-multi-server-queues/comparing-queue-probabilities-1.png" alt="plot of chunk comparing-queue-probabilities" /></p>
<p>By the time we have four servers, the chance of waiting is
barely noticeable, even when \(\rho = 1\).</p>
<h2 id="multi-server-wait-times">Multi-Server Wait Times</h2>
<p>The average time spent waiting in the queue \(W_q\) is:</p>
\[W_q = \frac{1}{\mu(c - \rho) \cdot C(c,\rho)}\]
<p>From this we get the average latency \(W\) quite easily:</p>
\[W = \frac{1}{\mu} + W_q\]
<p>If we plot average latency for various values of \(c\), we
see how adding more servers is an effective way of reducing
latency</p>
<p><img src="/assets/figures/posts/2016-03-07-multi-server-queues/comparing-wait-times-1.png" alt="plot of chunk comparing-wait-times" /></p>
<p>Take note of the log scale on the y-axis. At \(\rho = 1\), the \(M/M/1\)
queue is at 100% utilisation and latency is tending towards \(\infty\).
The extra capacity with \(c=2\) and \(c=4\) is directly reflected in the
significantly smaller latencies.</p>
<h2 id="faster-servers-or-more-servers">Faster Servers or More Servers?</h2>
<p>When deploying an application, it’s interesting to consider whether
a smaller number of faster servers is better than a larger number of
slower servers. Ignoring any discussion of reliability, we can compare the
latency of different \(M/M/c\) queues to help us pick a configuration.</p>
<p>The plot below compares two queue models, one with \(\mu = 5\) and \(c
= 3\) and the other with \(\mu = 10\) and \(c = 2\).</p>
<p><img src="/assets/figures/posts/2016-03-07-multi-server-queues/comparing-models-1.png" alt="plot of chunk comparing-models" /></p>
<p>As you might expect, the queue with the lowest service rate has a higher
baseline latency. However, because there are more servers in that queue,
the latency as \(\rho\) increases remains steady. Recall the stability
condition \(a = \lambda / (c \mu) < 1\), and it should be apparent that
more servers will result in longer periods latency stability when
\(\lambda > \mu\).</p>
<p>To see more configurations in action, I’ve created a <a href="https://robharrop.shinyapps.io/mmc-latency-simulation/">small simulator</a>
that you can use to compare two different queue models.</p>
<h2 id="limitations-of-the-mmc-model">Limitations of the \(M/M/c\) model</h2>
<p>The \(M/M/c\) model is a reasonable way to model systems with multiple
servers, but it has some limitations. Since the service rate \(\mu\) is
a global parameter, it is not possible to model systems that have
different service rates per server. In a cloud scenario you might have
a set of core servers - all with the same service rate - running all the
time. During periods of heavy load, you might scale up with some
additional resources, but these may well have a different service rate,
especially if your base servers are especially beefy.</p>
<p>Another limitation with the \(M/M/c\) model is that it doesn’t account for
the overhead of splitting incoming traffic between the servers. In a web
environment, the individual servers receive their load from some
load-balancing infrastructure. The load balancer will also have a service
rate describing how fast it can do its work.</p>
<p>In my next post, I’ll discuss addressing these weaknesses using queue
networks. As the name implies, queue networks describe how individual
queues are composed into collaborating networks. A web application running
on two servers is described as a queue network with three nodes: one for
the load balancer, and one for each of the servers.</p>A few questions seem to come up again and again from the people who’ve been reading my posts on queue theory. Perhaps, the most common question is: “How do I model multi-server applications using queues?”. This in an excellent question since most of us will be running production systems with more than one server, be that multiple collaborating services or just a simple load-balanced service that has a few servers sharing the same incoming queue of customers.Reject Them or Make Them Wait?2016-02-27T17:00:00+00:002016-02-27T17:00:00+00:00http://robharrop.github.io/maths/performance/2016/02/27/finite-queue-latency<p>After showing my <a href="/maths/performance/2016/02/20/service-latency-and-utilisation.html">previous post</a> around at <a href="http://www.skipjaq.com">work</a>, a colleague responded
with <a href="http://www.webperformance.com/library/reports/windows_vs_linux_part1/">this article</a> in which the author compares the performance of
a Java EE application running on Windows and on Linux. When running on
Linux, the application exhibits the performance characteristics outlined
in my post: at high utilisation, latency grows uncontrollably. What might
be surprising however is that on Windows, latency doesn’t change much at
all, even at high utilisation. Does this mean that the results we saw for
\(M/M/1\) queues are wrong? Not quite! Whereas the Linux results show
increased latency at high utilisation, the Windows results show an
<strong>increased error count</strong>; at high utilisation Windows is simply dropping
connections and kicking waiting customers out of the queue.</p>
<p>Recall from the discussion on \(M/M/1\) queues that we assumed the
queue was of infinite size. No matter the load, there’s always space for
a new customer in an \(M/M/1\) queue. The behaviour exhibited by the
Windows system in the article isn’t that of an infinite queue, but instead
that of a finite queue. At some limit - the precise details of which are
irrelevant to this discussion - the queue gets full and customers are
rejected.</p>
<p>What’s the latency like for a queue that has finite size? If the queue can
reject customers, what’s the probability that a potential customer will be
allowed in to the queue? Let’s answer these questions by looking at the
\(M/M/1/K\) model.</p>
<p>An \(M/M/1/K\) queue behaves much like an \(M/M/1\), arrivals are
a Poisson process, service times are exponentially-distributed and there
is a single server. However, unlike \(M/M/1\) queues which allow an
unbounded number of customers into the system at any time, \(M/M/1/K\)
queues have an upper bound of \(K\) customers.</p>
<h2 id="steady-state-probabilities">Steady-State Probabilities</h2>
<p>All of the interesting calculations for \(M/M/1/K\) queues depend on the
steady-state probabilities, that is, the probability \(p_{n}\) that there
are \(n\) customers in the queue:</p>
\[p_{n} =
\begin{cases}
\frac{(1 - \rho)\rho^{n}}{1 - \rho^{K + 1}} & (\rho \neq 1)\\
\frac{1}{K + 1} & (\rho = 1) \\
\end{cases}\]
<p>Where \(\rho = \lambda / \mu\), \(\lambda\) is the arrival rate and
\(\mu\) is the service rate. Unlike the steady-state probabilities for
\(M/M/1\) queues, the probabilities for \(M/M/1/K\) queues are defined for
\(\rho \geq 1\) thanks to the limiting factor of \(k\).</p>
<h2 id="average-customers-in-the-system">Average Customers in the System</h2>
<p>Using the steady-state probabilities we can now calculate the average
number of customers in the system \(L\):</p>
\[L = \sum_{n=0}^{K} n \cdot p_{n} \\\]
<p>We can plot the mean number of customers in the system as \(\rho\)
increases and with \(K = 10\):</p>
<p><img src="/assets/figures/posts/2016-02-27-finite-queue-latency/customers-vs-utilisation-1.png" alt="plot of chunk customers-vs-utilisation" /></p>
<p>As we can see, no matter how large \(\rho\) gets, the number of customers
in the system never exceeds the bound set by \(K\).</p>
<h2 id="calculating-average-latency">Calculating Average Latency</h2>
<p>Now that we can calculate the average number of customers in the system
for a given \(\rho\), we can use Little’s Law to calculate the mean
latency. Recall from the discussion of \(M/M/1\) queues that Little’s Law
relates the average number of customers in the system \(L\) to the average
waiting time \(W\) and the arrival rate \(\lambda\):</p>
\[L = \lambda W\]
<p>Which we can re-arrange to:</p>
\[W = \frac{L}{\lambda}\]
<p>We might think that we can proceed from here to calculate the latency for
one of our \(M/M/1/K\) queues, but first we must ask ourselves: is
\(\lambda\) really a good measure of the arrival rate? Sure, \(\lambda\)
is a good measure of the rate at which customers <strong>want</strong> to arrive in the
system, but thanks to our limit \(K\), the <em>effective arrival rate</em> does
not grow unbounded; \(\lambda\) is not a good measure of the number of
arrivals that actually make it into the system. We’ll call this effective
arrival rate \(\lambda_{eff}\).</p>
<p>We calculate \(\lambda_{eff}\) by realising that we will accept customers
into the system if we’re not at the limit \(K\). Using our steady-state
probabilities we have:</p>
\[\lambda_{eff} = \lambda \cdot (1 - p_{K})\]
<p>That is, the effective arrival rate is the arrival rate multiplied by the
probability that we are <strong>not</strong> at maximum capacity. We can now calculate
latency using the effective arrival rate:</p>
\[W = \frac{L}{\lambda_{eff}}\]
<p>Plotting the wait time of \(M/M/1\) and \(M/M/1/K\) graphs side-by-side as
\(\lambda\) increases shows us how the limit \(K\) affects latencies at
high \(\rho\). For these graphs \(\mu = 100\) and \(K = 10\):</p>
<p><img src="/assets/figures/posts/2016-02-27-finite-queue-latency/latency-by-arrival-rate-1.png" alt="plot of chunk latency-by-arrival-rate" /></p>
<p>We can see how the latency profile of the \(M/M/1/K\) graph doesn’t have
the same degenerate behaviour that the \(M/M/1\) queue has; the finite
queue size prevents customers from seeing unbounded latencies.</p>
<h2 id="probability-of-getting-rejected">Probability of Getting Rejected</h2>
<p>If customers aren’t seeing unbounded latencies does that make finite
queues some kind of panacea? Obviously not! As we know from the Windows
vs. Linux benchmark that motivated this article, Windows traded
unacceptable latencies for an increased error count. We can quantify this
error count by calculating the probability of a customer getting rejected.
We call this the <em>loss probability</em> \(p_{loss}\). It doesn’t take much
thought to realise that the probability of getting rejected is simply
\(p_{K}\), the probability that the queue is full:</p>
\[p_{loss} =
\begin{cases}
\frac{(1 - \rho)\rho^{K}}{1 - \rho^{K + 1}} & (\rho \neq 1)\\
\frac{1}{K + 1} & (\rho = 1) \\
\end{cases}\]
<p>We can plot \(p_{loss}\) against \(\rho\) to see how the chance of seeing
an error increases as \(\rho\) increases:</p>
<p><img src="/assets/figures/posts/2016-02-27-finite-queue-latency/loss-vs-utilisation-1.png" alt="plot of chunk loss-vs-utilisation" /></p>
<p>Here we can see that, as the arrival rate nears the service rate (\(\rho\)
increases), the loss probability tends towards 100%. Note the log scale on
the x-axis to get a real feel for the fact that even at \(\rho = 1000%\)
the probability of loss still isn’t 100%.</p>
<h2 id="a-note-on-utilisation">A Note on Utilisation</h2>
<p>In my previous post I referred to the quantity \(\rho\) as the utilisation of
the queue. This was a little imprecise, because \(\rho\) is not a measure of
utilisation for all queues. Let’s dig deeper to really understand what \(\rho\)
and utilisation are measuring.</p>
<p>The ratio \(\rho = \lambda / \mu\) is a measure of <em>traffic intensity</em>. It
tells us how much traffic is in the entire universe of our queue and how much
processing capacity we have. Utilisation is a quantity that tells us how busy
our system is. For \(M/M/1\) queues, which have no bound, \(\rho\) and
utilisation are the same, because we’re never going to turn away a customer.</p>
<p>For \(M/M/1/K\) queues, \(\rho\) represents the intensity of traffic that we
are <em>seeing</em>, but the utilisation tells us how much of that traffic is actually
occupying the system. It should be obvious that we can measure utilisation as
the probability that the system is not empty \(U = 1 - p_{0}\).</p>
<p>The relationship between \(\rho\) and \(U\) is easy to see with a plot:</p>
<p><img src="/assets/figures/posts/2016-02-27-finite-queue-latency/rho-vs-utilisation-1.png" alt="plot of chunk rho-vs-utilisation" /></p>
<p>Notice how utilisation hits 100% <strong>after</strong> \(\rho\) passes 100% - this is
the limiting factor \(K\) in action.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The benchmark that motivated this article showed that we can trade
unbounded latency for an increase in rejected connections. While the
Linux system behaved like an \(M/M/1\) queue, letting latencies grow but
trying to serve every request, the Windows system behaved like an
\(M/M/1/K\) queue, guaranteeing an acceptable latency for all requests
that were accepted, but rejecting most requests when the system was
heavily utilised.</p>
<p>The question remains: is it better to let latencies grow or should we
reject some customers to ensure others get a better service? The answer to
this question largely depends on the circumstances of your system, but it
should be apparent that latencies get unacceptable quickly if your
system is loaded and you’re not limiting the number of customers you
serve. In my opinion, it’s better to preserve good service for a smaller
number of customers rather than give bad service to all customers, which
is what will happen as latency starts to degenerate under heavy load if
your queue isn’t bounded.</p>After showing my previous post around at work, a colleague responded with this article in which the author compares the performance of a Java EE application running on Windows and on Linux. When running on Linux, the application exhibits the performance characteristics outlined in my post: at high utilisation, latency grows uncontrollably. What might be surprising however is that on Windows, latency doesn’t change much at all, even at high utilisation. Does this mean that the results we saw for \(M/M/1\) queues are wrong? Not quite! Whereas the Linux results show increased latency at high utilisation, the Windows results show an increased error count; at high utilisation Windows is simply dropping connections and kicking waiting customers out of the queue.Relating Service Utilisation to Latency2016-02-20T17:00:00+00:002016-02-20T17:00:00+00:00http://robharrop.github.io/maths/performance/2016/02/20/service-latency-and-utilisation<p>At <a href="http://www.skipjaq.com">Skipjaq</a>, we are interested in how applications
perform as they approach the maximum sustainable load. We don’t want to
completely saturate an application so it falls over, but we also don’t
want to under-load the application and miss out on true performance
numbers. In particular, we are interested in finding points in the load
where latencies are on the precipice of moving outside acceptable limits.</p>
<p>In a recent conversation with the team about web application latencies,
I mentioned that, as a general rule, we should expect latencies to degrade
sharply once the service hits around 80% utilisation. More specifically,
we should expect the <em>wait</em> time of the service to degrade, which will
cause the latency to degrade in turn.</p>
<p>John D. Cook wrote <a href="http://www.johndcook.com/blog/2009/01/30/server-utilization-joel-on-queuing/">a great explanation</a> of why this is the case, but
I wanted to write a slightly deeper explanation for those who have no prior
experience with queuing theory.</p>
<h2 id="services-as-queues">Services as Queues</h2>
<p>The argument for why latency degrades so badly at 80% follows directly from
the results of queuing theory. We can start to understand this by first
understanding how a service like a web application can be modelled using
queuing theory.</p>
<p>For the purpose of this discussion, we’ll assume that we are interested in
measuring the latency of a web application - the service - and that we are
running that application on a single server. Requests arrive at the
service and are processed as quickly as possible. If the service is too
busy processing other requests when a new request arrives, then that
request waits in the queue until the service can process it. For
simplicity, we’ll assume that the queue is unbounded and that once
a request is in the queue, the only way it can leave is by getting
processed by the service.</p>
<p>The simplest queue model we can ascribe to our service is the \(M/M/1\)
model. This notation is called <a href="https://en.wikipedia.org/wiki/Kendall%27s_notation">Kendall’s notation</a> and takes the
general form \(A/S/c\), where \(A\) is the arrival process, \(S\) is the
service time distribution and \(c\) is the number of servers.</p>
<p>Our fictional service has only one server hence \(c = 1\). The \(M\) in the
model stands for Markov. The Markovian arrival process describes
a <a href="https://en.wikipedia.org/wiki/Poisson_point_process">Poisson process</a>, that is a process where the time between each
arrival and the next (the inter-arrival time) is exponentially-distributed
with parameter \(\lambda\). The Markovian service time distribution has
service times exponentially-distributed with parameter \(\mu\).</p>
<h2 id="queue-utilisation">Queue Utilisation</h2>
<p>We define service utilisation as the percentage of time the service is
busy serving requests. For \(M/M/1\) queues, the utilisations is given by
\(\rho = \lambda / \mu\). The queue is only stable when \(\rho < 1\).
This makes intuitive sense; if there are more arrivals than can be
processed by the server, then the queue will grow indefinitely.</p>
<h2 id="calculating-latency">Calculating Latency</h2>
<p><a href="https://en.wikipedia.org/wiki/Little%27s_law">Little’s Law</a> is one of the most interesting results from queuing
theory. Put simply it states that the average number of customers in
a stable system (\(L\)) is equal to the arrival rate (\(\lambda\))
multiplied by the average time a customer spends in the system (\(W\)):</p>
\[L = \lambda W\]
<p>The average time a customer spends in the system is equivalent to the
average latency a customer will see. This value is a combination of the
average service time and the average time spent waiting in the queue.
Intuitively we should realise that, in a running system, the average
service time is broadly fixed and that latency variations typically stem
from variations in the wait time.</p>
<p>Since we are interested in calculating latency and not the average number
of customers in the system, we can rearrange Little’s Law to put the
latency (\(W\)) on the left-hand side:</p>
\[W = \frac{L}{\lambda}\]
<p>So now, if we know the average number of customers in the system, we can
calculate the wait time. The mean number of customers in an \(M/M/1\) queue
is given by:</p>
\[\frac{\rho}{1 - \rho}\]
<p>Deriving this equation from first principles is beyond the scope of this
blog, but it follows from the steady-state probabilities of the Markov
chain describing the process. You can find a good description of the maths
behind this <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.136.9734&rep=rep1&type=pdf">here</a>.</p>
<p>Reminding ourselves that \(\rho = \lambda / \mu\):</p>
\[W = \frac{\frac{\rho}{1 - \rho}}{\lambda} = \frac{\rho}{\lambda (1 - \rho)} = \frac{1 / \mu}{1 - \rho} = \frac{1}{\mu - \lambda}\]
<p>So now we have a simple formula that relates latency to the arrival rate
and the service rate, but what we really want is a formula relating
<em>utilisation</em> to latency. To do this, recognise that \(\lambda = \rho
\mu\):</p>
\[W = \frac{1}{\mu - \lambda} = \frac{1}{\mu - \rho \mu} = \frac{1}{\mu (1 - \rho)}\]
<p>As discussed, we can assume that \(\mu\) is constant for a running system
and that the main contribution to changes in service utilisation will come
from changes in the arrival rate. Thus the latency is proportional to
\(1/(1 - \rho)\). If we plot this, we can see a sharp uptick in latency
when utilisation hits around 80% after which the latency tends towards
infinity as the utilisation tends towards to 100%.</p>
<p><img src="/assets/latency-utilisation/plot.png" alt="Plotting Latency vs. Utilisation" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>Once service utilisation exceeds 80%, latencies suffer dreadfully. To avoid being
surprised by disastrous latencies in production systems, it’s important to monitor
utilisation and take action as it approaches the 80% danger zone.</p>
<p>When testing system performance, loading a system much beyond the 80% utilisation
mark will likely result in latencies that are wildly unacceptable. Load that
system at close to 100% and you should expect to wait quite some time to see
your tests complete!</p>At Skipjaq, we are interested in how applications perform as they approach the maximum sustainable load. We don’t want to completely saturate an application so it falls over, but we also don’t want to under-load the application and miss out on true performance numbers. In particular, we are interested in finding points in the load where latencies are on the precipice of moving outside acceptable limits.