Index ¦ Archives ¦ Atom

Placement Scale Fun

Some notes on exploring how an OpenStack placement service behaves at scale.

The initial challenge is setting up a useful environment. To exercise placement well we need either or both of lots of instances and lots of resource providers (in the form of compute nodes where those instances can land). In the absence of unlimited hardware this needs to be faked in some fashion.

Thankfully, devstack provides ways to make use of the fake virt driver to boot fake instances that don't consume much in the way of resources (but follow the true API during the spawn process), and to create multiple nova-compute processes on the same host to manage those fake instances.

The process of figuring out how to make this go was a combination of grep, talking to people, and trying and failing multiple times. This summary is much tidier than the "omg, I have no idea what I'm doing" process of fail and fail again that led to it.

Also note that I'm not doing formal benchmarking here. Rather I'm doing human observation of where things go wrong, what variables are involved and how things feel. This is an important precursor to real benchmarking to have a clue how the system works. The set up I'm using would not be ideal for benchmarking, for example, as the VMs I'm using are on the same physical host (in this case a dual xeon es-2620, 32 GB server running esxi) meaning they impact each other (especially given the way I've configured the VMs), and aren't subject to physical networking.

Another thing to note is that while a lot of this experimentation could be automated, not doing so gives me deeper insight into how things work, exposes bugs that need to be fixed, and has all the usual benefits gained from doing things "the hard way". For formal testing (where repeating things is paramount) all this faffing about by humans would not be good. But for this, it is.

I eventually landed on the following set up with two VMs, one as the control plane (ds1), one as the compute host (cn1).

  • ds1 is a 16 core, 16GB VM. It's hosting control plane services and mysql and rabbitmq. This is where the scheduler and placement run.
  • cn1 is a 10 core, 11GB VM and is running 75 nova-compute process, the metadata server, and neutron agent.
  • To limit message bus traffic, notifications are configured to only send unversioned rather than the default of both. There's currently no easy way to disable notifications entirely.
  • The "Noop" quota driver is used because we don't want to care about quotas in this case.
  • The filter scheduler is used, but all filters are turned off.

These last two tricks were learned from some devstack experiments by Matt Riedemann.

Both VMs are Ubuntu Artful, both are using master for all the OpenStack services, except for devstack itself, which needs this fix (to a bug caused by me).

The devstack configurations are relatively straightforward, the important pieces are:

  • Setting the virt driver: VIRT_DRIVER=fake
  • Telling devstack how many fake compute nodes we want: NUMBER_FAKE_NOVA_COMPUTE=75. This will create multiple compute nodes each of which uses a common config file, plus a config file unique to the process that sets the host name of the nova-compute process (required to get unique resource providers).
  • Manipulating the nova.conf with a [[post-config|$NOVA_CONF]] section to set a few things.

The local.conf for cn1 (the compute host) is:

[[local|localrc]]
HOST_IP=192.168.1.149
SERVICE_HOST=192.168.1.76
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
MULTI_HOST=1
MYSQL_HOST=$SERVICE_HOST
RABBIT_HOST=$SERVICE_HOST
GLANCE_HOSTPORT=$SERVICE_HOST:9292
RECLONE=yes
ENABLED_SERVICES=n-cpu,q-agt,n-api-meta,placement-client
VIRT_DRIVER=fake
NUMBER_FAKE_NOVA_COMPUTE=75

[[post-config|$NOVA_CONF]]
[quota]
driver = "nova.quota.NoopQuotaDriver"
[filter_scheduler]
enabled_filters = '""'
[notifications]
notification_format = unversioned

I'm using static IPs because it makes things easier. If you are trying to repeat this in your own environment your HOST_IP and SERVICE_HOST will likely be different. Everything else ought to be the same. Explicitly setting ENABLED_SERVICES ensures that only the stuff you really need is running. See Multi-Node Lab for some more information on multi-node devstack (Note that there is a lot in there you don't need to care about if you aren't actually going to use the VMs that you create in the deployment).

The local.conf for the control plane (ds1) mostly uses defaults but disables some services that we don't care about, and adjusts the nova config as required:

[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
MULTI_HOST=1
VIRT_DRIVER=fake
RECLONE=True

disable_service horizon
disable_service dstat
disable_service tempest
disable_service n-cpu
disable_service q-agt
disable_service n-api-meta

[[post-config|$NOVA_CONF]]
[quota]
driver = "nova.quota.NoopQuotaDriver"
[filter_scheduler]
enabled_filters = '""'
[notifications]
notification_format = unversioned

Note that we are disabling the services that will be running on the compute host.

There are redundancies between these two files. Some of the stuff required by one is in the other. This is because I started out with nova-compute on both hosts and haven't fully rationalized the local.conf files.

Now that we know what we're building we can build it. The control plane (ds1) needs to be in place first so build devstack there first:

cd wherever_devstack_is
./stack.sh

and wait. When it completes do the same on the compute host (cn1).

When that is done, the control host needs to be made aware of the compute host, after which you can verify the presence of the 75 hypervisors:

. openrc admin admin
nova-manage cell_v2 discover_hosts
openstack hypervisor list

Playing With It

Once all that is done it is possible to send a few different workload patterns at the service. It's hard to do this in a way that isolates any particular service as they all interact so much.

In my first round of experiments, yesterday, I tried a few different scenarios to get a sense of how things worked and what variables exist.

When booting a large number of servers from a small number of nova boot commands with a high min-count (e.g., 1000) the placement api processes are lost as noise in the face of the much greater effort being made by nova-conductor.

It is only when a larger number of smaller requests (15 concurrent requests for 50 instances each) are made that the placement API begins to show any signs that it is working hard. This is about what you would expect: talking to /allocation_candidates is certainly where most effort happens and most data is processed.

Today I decided to narrow things down to making lots of parallel boots of single instances, to impact the placement service as much as possible.

If you intend to start many nova boot (or openstack server) commands at the same time, make sure you do them from a third machine. I tried to do 300 nova boot commands, and pushed my load average over 400 and brought the world to a complete stop.

In the current devstack (February 2018) we can use built in flavor and image references when making a boot request. In addition, since we are making fakes we can set the nic to none. This boots one server named foobar using the m1.tiny flavor:

nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk foobar

We can boot 1000 of those with:

nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk -min-count 1000 foobar

Each instance will get a numeric suffix. As stated above this doesn't stress placement much.

If we do want to stress placement we need to increase the number of concurrent requests to GET /allocation_candidates, at which point the number of instances per boot request is less of an issue. One way to do this is to background a mess of boot commands:

for i in {1..100}; do \
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar &
done

But more often than not this will cause the calls to the nova-scheduler process to timeout when the conductor tries to call select_destinations. We can work around this by hacking nova-scheduler to have more workers. Since this is something that requires a hack presumably there's a reason for it.

diff --git a/nova/cmd/scheduler.py b/nova/cmd/scheduler.py
index 51d5aee4ac..d794eacaf3 100644
--- a/nova/cmd/scheduler.py
+++ b/nova/cmd/scheduler.py
@@ -45,5 +45,5 @@ def main():

      server = service.Service.create(binary='nova-scheduler',
                                      topic=scheduler_rpcapi.RPC_TOPIC)
-    service.serve(server)
+    service.serve(server, workers=4)
     service.wait()

Running four nova-scheduler workers, the above nova boot command works fine with no timeout. However, code to do this was never merged for reasons (which may or may not still be valid with the existence of placement) discussed on the review and in a related email.

Then I tried:

for i in {1..500}; do \
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar & \
done

500 parallel boots. This caused the Apache process (which provides a front end to keystone, glance, the compute api, and placement) to freeze up and need MaxRequestWorkers raised. Apache (in a default configuration) is a pretty weak link in this stuff. It's easy to see why people prefer nginx in situations where all the web server is really doing is being a reverse proxy.

Once Apache is sorted, then it is my (non-VM) machine doing the nova boots that suffers. It seems that 500 nova boot that are doing actual work instead of just timing out trying to contact a stuck web server is not a happy way to be. 15 minutes later it woke up and boots started. shrug.

At which point select_destinations started timing out again. 4 workers not enough? I can (and did) raise it to eight but it doesn't change the fact that my 500 parallel nova boot commands get stuck if run from one machine, and at the moment I've run out of free hardware.

So instead I've spread the load a bit:

for j in {1..10} ; do \
  for i in {1..50}; do \
    nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar & \
  done ; \
  sleep 60 ; \
done

After this I get 500 ACTIVE instances in fairly short order. The processes which seem to get the most work are cell1 conductor, interleaved with the nova-scheduler.

At this stage it makes sense to check that the placement database has expected data:

  • 1500 allocations: correct. (3 for each instance)
  • 75 resource providers: correct.

Then it is time to delete all those servers:

openstack server list -f value -c ID | xargs openstack server delete

While that is happening it is again the cell conductor that sweats.

0 allocations in the placement db when that's done. ✔

Random Observations

Some thoughts that didn't quit fit in anywhere else:

  • We know this already, but an idle compute-manager is fairly chatty with the placement service. If you have 75 of them, that chat starts to add up: Approximately 246 requests per minute, checking on the state of inventory and allocations. Work is already in progress to investigate this, but it should be noted that the placement service handles this traffic with aplomb. In fact at no point during the entire exercise did the placement service sweat.
  • It makes sense that if you're going to have 8 conductors you want at least 8 schedulers?
  • This stuff simply won't work without multiple scheduler workers. If the rpc timeout limit is raised that can make things work but only very slowly. This suggests that it is important for us to a) make sure that multiple workers is safe, b) change the code (as the diff above) so that workers can happen, c) recommend doing it.
  • The placement UWSGI processes appear fairly stable memory-wise.
  • It's important to note that no traits, custom resource classes, nested resource providers, aggregates or shared resource providers are used here. Having any of those in the loop could impact the profile. We don't yet know.
  • The control plane host is working at full tilt through all of this. The compute host not much at all (because it is fake). This suggests that distributing the control plane services broadly is important. I will probably try to integrate these experiments with my placement container experiments, putting those containers on a different host. It looks like having the cell conductor elsewhere would be interesting to observe as well.
  • Doing this kind of thing is a huge learning experience and a valuable use of time (despite taking a lot of time). I wish I could remember that more often.

© Chris Dent. Built using Pelican. Theme by Giulio Fidente on github.