This is part three of a series about the server scheduler in OpenStack Nova. See part one or two.
Warning: This document describes code that is (as of December 2015) in development. Some of it is not yet released. Assuming everything goes well a more accurate sheduler will be available as part of the Mitaka release.
The job of the scheduler is to return a list of hosts on which to
emplace the instances described by a RequestSpec
. That list of
hosts is selected from all the compute nodes that Nova knows about.
This post focuses on how the filter scheduler works. There are
others that start from the same list of hosts but handle it
differently. For example the chance scheduler picks a suitable
number of hosts at random.
Every time select_destinations
is called on the filter scheduler
it asks a HostManager
for a list of HostState
objects. Each
HostState
represents the state (available disk and RAM, cpu
architecture, etc) of a single compute node. The HostManager manages
this list, updating it with each request with a list of active
compute nodes: If a host is already in its map it is updated with
new state, if it is not in the map an entry is created, if there are
hosts in the map for which there is no corresponding active node the
entry is removed.
A subclass of the filter scheduler caches the data from the HostManager instead of updating with every request. This can speed things up (by limiting the need to query the compute nodes for updated state) but increases the risk of scheduler retries: a selected destination not being able to accept a requested instance because its state has changed since being selected.
The collection of HostState objects is returned to the scheduler which winnows the collection by passing the collection to an ordered list of filters. Each filter returns only those hosts which pass the filter. This shrinking collection is passed to each subsequent filter and finally returned back to the scheduler. If the collection is smaller than the requested number of instances, the request fails with an error.
The collection is then ordered by weighers. These sort hosts which pass the requirements by the degree to which they pass. For example, in the default configuration a host which has more free RAM will be preferred over a host with less. This helps to spread instances amongst hosts. A configuration change can invert the RAM weigher, causing instances to be stacked onto already in use hosts before unused hosts.
A host is chosen from the top of the weighed collection and
appended to a results list. The HostState
associated with the chosen host is updated to consume the resources
that will be used by the one instance currently being processed.
If more instances are required by the RequestSpec
the already
filtered collection of hosts is filtered and weighed again to choose
another host (possibly the same one). This looping continues until no
more instances are required or the filtering returns no hosts. With each
loop resources are consumed, potentially changing the results of both
the filtering and the weighing.
The scheduler returns the list of destinations to the conductor. At
this stage no instances have yet been built. The conductor pairs the
list of destinations with the list of instances it already had and
for each pair calls build_and_run_instance
over RPC to the target
destination. If the target destination still has appropriate
resources the instance is built. If something goes wrong or there
aren't enough resources, if the originating RequestSpec
allows
retries, the compute node will ask the conductor to reschedule this
one instance.
That was a narrative overview of what happens in the scheduler. There will be errors and oversimplications. If you spot something that is incorrect, please leave a comment. There are plenty of places in the scheduler where things can go wrong and plenty of ways in which the code is difficult to manage, extend or maintain. Future posts in this series will investigate these problems and ways to improve the situation.