Index ¦ Archives ¦ Atom

Placement Performance Analysis

Performance has always been important to Placement. In a busy OpenStack cloud, it will receive many hits per second. Any slowness in the placement service will add to the latency present in instance creation and migration operations.

When we added support for requesting complex topologies of nested resource providers, performance took an expected hit. All along, the plan was to make it work and then make it fast. In the last few weeks members of the Placement team have been working to improve performance.

Human analysis of the code can sometimes suggest obvious areas for performance improvement but it is also very easy to be misled. It's better to use profiling and benchmarking to get accurate measurements of what code is using the most CPU and to effectively compare different revisions of the code.

I've written two other postings about how to profile WSGI apps and analyse the results. Using those strategies we've iterated through a series of changes using the following process:

  1. profile to find the most expensive chunk of code
  2. determine if it can be improved and how
  3. change the code
  4. benchmark to see if it really helps, if it does, keep it, otherwise try something else
  5. repeat

The most recent big feature added to placement was called same_subtree. It adds support for requiring that a subset of the solution set for a request be under the same ancestor resource provider. This helps to support "affinity" within a compute host (e.g., "this FPGA is under the same NUMA node as this FPGA").

What follows are some comparison numbers from benchmarks run with the commit that added same_subtree and recent master (between which several performance tweaks have been added). The test host is a Linux VM with 16 GB of RAM, 16 VCPU. Placement is running standalone (without keystone), using PostgreSQL as its database and uwsgi as the web server with the following startup

uwsgi --http :8000 --wsgi-file .tox/py37/bin/placement-api --processes 4 --threads 10

all on that same host.

Apache benchmark is run on an otherwise idle 8 core machine on the same local network. Headers are set with -H 'x-auth-token: admin' and -H 'openstack-api-version: placement latest' to drive appropriate noauth2 and microversion settings.

The server is preloaded with 7000 resource providers created using the nested-perfload topology.

The URL requested is:

GET /allocation_candidates?
     resources=DISK_GB:10&
     required=COMPUTE_VOLUME_MULTI_ATTACH&
     resources_COMPUTE=VCPU:1,MEMORY_MB:256&
     required_COMPUTE=CUSTOM_FOO&
     resources_FPGA=FPGA:1&
     group_policy=none&
     same_subtree=_COMPUTE,_FPGA

The Older Code

ab -c 1 -n 10 [the rest] (1 concurrency, 10 total requests):

Requests per second:    0.40 [#/sec] (mean)
Time per request:       2472.930 [ms] (mean)

ab -c 40 -n 400 [the rest] (40 concurrency, 400 total requests):

Requests per second:    1.46 [#/sec] (mean)
Time per request:       27454.696 [ms] (mean)

(For concerned benchmark purists: throughout this process I've also been running with thousands of requests instead of tens or hundreds to make sure that the mean values I'm getting here aren't because of the short run time. They are not. Also, not reported here, but I've also been doing benchmarks to compare how concurrent I can get before something explodes. As you might expect: as individual requests become lighter, the wider we can get.)

The New and Improved Code

(These numbers are not quite up to date. They are from a recent master but there are at least four more performance-related patches yet to merge. I'll update when that's all in.)

ab -c 1 -n 10 [the rest] (1 concurrency, 10 total requests):

Requests per second:    0.70 [#/sec] (mean)
Time per request:       1423.695 [ms] (mean)

ab -c 40 -n 400 [the rest] (40 concurrency, 400 total requests):

Requests per second:    2.90 [#/sec] (mean)
Time per request:       13772.054 [ms] (mean)

How'd We Get There?

This is a nice improvement. It may not seem like that much — over 1 second per request is rather slow in the absolute — but there is a lot happening in the background and a lot of data being returned.

One response is a complex nested JSON object of 2583330 bytes. It has 154006 lines when sent through json_pp.

There are several classes of changes that were made. These might be applicable to other environments (like yours!):

  • If using SQLAlchemy, using the RowProxy object directly, within the persistence layer, is okay and much faster that casting to a dict or namedtuple (which have interfaces the RowProxy already provides).

  • Use __slots__ in frequently used objects. It really does speed up attribute access time.

  • Profiling can often reveal sets of data that are retrieved multiple times. If you can find these and build them incrementally in the context of a single request/operation it can be a big win. See Add RequestWideSearchContext.summaries_by_id and Track usage info on RequestWideSearchContext for examples.

  • If you're doing anything with membership checking with a list and you're able to make it a set, do.

  • When using SQLAlchemy's in_ operator with a large number of values, an expanding bindparam can make a big difference in performance.

  • Implementing __copy__ on simple classes of object that are copied many times in single requests. Python's naive copy is expensive, in aggregate.

Also, not based on the recent profiling, but in earlier work comparing non-nested setups (we've gone from 1.2 seconds for a GET /allocation_candidates?resources=DISK_GB:10,VCPU:1,MEMORY_MB:256 request against 1000 providers in early January to .53 seconds now) we learned the following:

  • Unless you absolutely must (perhaps because you are doing RPC), avoid using oslo versioned objects. They add a lot of overhead for type checking and coercing when getting and setting attributes.

What's Next?

I'm pretty sure there are a lot more improvements to be made. Each pass through the steps listed above exposes another avenue for investigation. Thus far we've been able to make improvements without too much awareness of the incoming request: we've not been adding conditionals or special-cases. Adding those will probably take us into a new world of opportunities.

Most of the application time is spent interacting with the database. Little has yet been done to explore tweaking the schema (things like de-normalization) or tweaking the database configuration (threads available, cache sizes, using SSDs). All of that will have impact.

And, in the end, because Placement is a simple web application over a database, the easiest way to get more performance is to make more web and database servers and load balance them. However, that's a cop out, we should save cycles where we can. Everything is expensive at scale.

© Chris Dent. Built using Pelican. Theme by Giulio Fidente on github.