Write – Unpacking the Nova Scheduler (part 3 of ?)

Unpacking the Nova Scheduler (part 3 of ?)

Posted on: Thu 17 December 2015

Category: openstack – Tags: scheduler, openstack, opensource

This is part three of a series about the server scheduler in OpenStack Nova. See part one or two.

Warning: This document describes code that is (as of December 2015) in development. Some of it is not yet released. Assuming everything goes well a more accurate sheduler will be available as part of the Mitaka release.

The job of the scheduler is to return a list of hosts on which to emplace the instances described by a RequestSpec. That list of hosts is selected from all the compute nodes that Nova knows about. This post focuses on how the filter scheduler works. There are others that start from the same list of hosts but handle it differently. For example the chance scheduler picks a suitable number of hosts at random.

Every time select_destinations is called on the filter scheduler it asks a HostManager for a list of HostState objects. Each HostState represents the state (available disk and RAM, cpu architecture, etc) of a single compute node. The HostManager manages this list, updating it with each request with a list of active compute nodes: If a host is already in its map it is updated with new state, if it is not in the map an entry is created, if there are hosts in the map for which there is no corresponding active node the entry is removed.

A subclass of the filter scheduler caches the data from the HostManager instead of updating with every request. This can speed things up (by limiting the need to query the compute nodes for updated state) but increases the risk of scheduler retries: a selected destination not being able to accept a requested instance because its state has changed since being selected.

The collection of HostState objects is returned to the scheduler which winnows the collection by passing the collection to an ordered list of filters. Each filter returns only those hosts which pass the filter. This shrinking collection is passed to each subsequent filter and finally returned back to the scheduler. If the collection is smaller than the requested number of instances, the request fails with an error.

The collection is then ordered by weighers. These sort hosts which pass the requirements by the degree to which they pass. For example, in the default configuration a host which has more free RAM will be preferred over a host with less. This helps to spread instances amongst hosts. A configuration change can invert the RAM weigher, causing instances to be stacked onto already in use hosts before unused hosts.

A host is chosen from the top of the weighed collection and appended to a results list. The HostState associated with the chosen host is updated to consume the resources that will be used by the one instance currently being processed.

If more instances are required by the RequestSpec the already filtered collection of hosts is filtered and weighed again to choose another host (possibly the same one). This looping continues until no more instances are required or the filtering returns no hosts. With each loop resources are consumed, potentially changing the results of both the filtering and the weighing.

The scheduler returns the list of destinations to the conductor. At this stage no instances have yet been built. The conductor pairs the list of destinations with the list of instances it already had and for each pair calls build_and_run_instance over RPC to the target destination. If the target destination still has appropriate resources the instance is built. If something goes wrong or there aren't enough resources, if the originating RequestSpec allows retries, the compute node will ask the conductor to reschedule this one instance.

That was a narrative overview of what happens in the scheduler. There will be errors and oversimplications. If you spot something that is incorrect, please leave a comment. There are plenty of places in the scheduler where things can go wrong and plenty of ways in which the code is difficult to manage, extend or maintain. Future posts in this series will investigate these problems and ways to improve the situation.

Posted on: Thu 17 December 2015

Category: openstack – Tags: scheduler, openstack, opensource