Placement Container Playground 5

Posted on: Mon 19 March 2018

Category: openstack – Tags: openstack, opensource, placement

This is the fifth in a series of posts about experimenting with OpenStack's placement service in a container. In the previous episode, I made an isolated container that persists data to itself work in kubernetes. In there I noted that persisting data to itself rather takes the joy and functionality out of using kubernetes: You can't have replicas, you can't autoscale, you lose all your data.

I spent some of yesterday and today resolving those issues and report on the results here: an autoscaling placement service that persists data to a postgresql server running $elsewhere.

The code for this extends the same branch of placedock as playground 4 and continues to use minikube. I gave up trying to get things to work on linux with the kvm or kvm2 drivers. I should probably try the none driver at some point, but for now this work has been happening on a mac. Update the next day: Tried the none driver, worked fine, but need to be aware of docker permissions.

There are two main chunks to make this work:

Adapting the creation of the container and the creation and syncing of the database so that the database can be outside the container.
Tweaking the kubernetes bits to get a horizontal pod autoscaler working.

Note: This isn't a tutorial on using kubernetes or placement, it's more of a trip report about the fun stuff I did over the weekend. If you try to follow this exactly for managing a placement service, it's not going to work very well. Think of this as a conversation starter. If you're interested in this stuff, let's talk. I recognize that my writing on this topic has become increasingly incoherent as I've gone off into the weeds of discovery. I will write a summary of all the playgrounds once they have reached a natural conclusion. For now, even in the weeds, they've taught me a lot.

Database Tweaking

In playground 4, the database is established when building the container: every container gets its own sqlite db sitting there ready and waiting for run time. This does not work if we want to use a remote db and we want multiple containers talking to the same db. Therefore the sync.py script, which creates the database tables is copied into the container at build time, but not actually run until run time.

At run time, it gets the database connection url from an environment variable, DB_STRING. If it's not set, a default is used. We can define the value in the kubernetes deployment.yaml.

But wait, the container had only been running the uwsgi process. How do we get it to use the environment variable, run sync.py, and only once that's done, start up the uwsgi server? Turns out we can replace the existing docker CMD with a script that does all that stuff. In the Dockerfile the end is adjusted to:

ADD startup.sh /
CMD ["sh", "-c", "/startup.sh"]

and startup.sh is:

DB_STRING=${DB_STRING:-sqlite:////cats.db}

# Do substitutions in the template.
sed -e "s,{DB_CONNECTION},$DB_STRING," < /etc/nova/nova.conf.tmp >
/etc/nova/nova.conf

# establish the database
python3 /sync.py --config-file /etc/nova/nova.conf

# run the web server
/usr/sbin/uwsgi --ini /placement-uwsgi.ini

It would surprise me not one iota if there aren't cleaner ways than that, but that way worked.

The result of this is that each time a container starts, it connects to the database described in $DB_STRING and tries to create and update the tables. If they are already created, it's happy. If something else is in the midst of versioning the database an exception is caught and ignored.

I had a postgresql server running on a nearby VM, so I used that. For the time being I simply added the necessary connection drivers to the container at build time, but if it was required to be super dynamic, then the python driver code could be installed at runtime. Being super dynamic is not really in scope for my experiments.

After doing all that, I adjusted my deployment to have 4 replicas and made sure things worked. And it did. Onward.

Kubernetes Tweaking

Having 4 containers, either doing nothing or being overloaded, is not really taking advantage of some of the best stuff about kubernetes. What we really want, to be both more useful and more cool, is to create and destroy the placement pods as needed. This is done with a Horizontal Pod Autoscaler. You tell it the minimum and maximum number of pods you're willing to accept and a metric for determining the percent of resource consumption that is the boundary between OK and overloaded.

Here's the autoscaler.yaml that works for me:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: placement-deployment
  namespace: default
spec:
  maxReplicas: 10
  minReplicas: 1
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: placement-deployment
  targetCPUUtilizationPercentage: 50

While this is a relatively simple concept and the tooling is straightforward it took me quite some time to get this to work. I've been using minikube 0.25.0 (the latest release as of this writing) but running it at the maximum version of kubernetes that it supports (v1.9.0). This leads to some conflicts.

In older versions, the expected way to manage autoscaling and metrics is to use heapster. Minikube includes an addon for this, but as installed it does not present a "rest" API for the information. That's okay for some unclear number of kubernetes versions back, but is not with v1.9.0.

Modern kubernetes has the Resource Metrics API. heapster can support that, but only if it is started with a particular flag. The other option is to start the kubernetes-controller with a particular flag so that autoscaling doesn't use the metrics API. I preferred to stay modern.

Unreleased minikube adds the metrics server as an addon. I was able to copy that code into my minikube setup and establish the service.

Next I discovered that my placement deployment needed to describe a resource limit in order for the autoscaling to work. In hindsight this is obvious. The autoscaling is done based on a percentage of a limit the deployment sets for itself. For instance if you say that say things should be scaled up when resource usage hits 50%, kubernetes says "50% of what?".

In my case that meant adjusting the containers section of deployment.yaml:

  containers:
  - name: placement
    image: placedock:1.0
    env:
    - name: DB_STRING
      value: postgresql+psycopg2://cdent@192.168.1.76/placement?client_encoding=utf8
    ports:
    - containerPort: 80
    # We must set resources for scaling to work.
    resources:
      requests:
        cpu: 250m

That last stanza is saying "we request 1/4 core worth of cpu". So now, the autoscaler is expressing when cpu utilization hits 50% (of 1/4 core), scale.

Note: If you're following along and using minikube with docker-machine keep in mind that the default "machine" is pretty small so you need to keep the resource request for each individual container pretty small or you will soon overwhelm the machine. I had cpu above set to 1000m initially. Starting new pods was slow enough that they never became ready.

When I finally got this working it was fun to watch (literally with watch kubectl get hpa). If you're starting from scratch it can take a while for everything to warm up and be running, but eventually you'll see low usage and low replicas (this output is wide, you may need to scroll):

NAME                   REFERENCE                         TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
placement-deployment   Deployment/placement-deployment   0% / 50%  1         10        1          1m

To load up the deployment and make it scale (assuming there's a bit of data in the database, like the one resource provider that the gabbi-run in bootstrap.sh will create) I did this:

export PLACEMENT=$(minikube service placement-deployment --url)
ab -n 100000 -c 100 -H 'x-auth-token: admin' $PLACEMENT/resource_providers

After a while the target usage raises above the 50% target and more replicas are created.

NAME                   REFERENCE                         TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
placement-deployment   Deployment/placement-deployment   109% / 50% 1         10        3          8m

When the ab was done and resource usage settled, all but one of the containers were terminated. That one was working as expected:

curl -H 'x-auth-token: admin' $PLACEMENT/resource_providers |json_pp

Even though I know that's exactly how it is supposed to work, it's still pretty cool. What next? I need to add forbidden traits support to the placement service, but after that I will likely revisit this stuff, either for more scale fun or the cat management mentioned in playground 4.

As always, please leave a comment or otherwise contact me if you have questions, there's something I've done weirdly, or you're doing similar stuff and there's an opportunity for us to collaborate.

Posted on: Mon 19 March 2018

Category: openstack – Tags: openstack, opensource, placement