This is the fifth in a series of posts about experimenting with OpenStack's placement service in a container. In the previous episode, I made an isolated container that persists data to itself work in kubernetes. In there I noted that persisting data to itself rather takes the joy and functionality out of using kubernetes: You can't have replicas, you can't autoscale, you lose all your data.
I spent some of yesterday and today resolving those issues and
report on the results here: an autoscaling placement service that
persists data to a postgresql server running $elsewhere
.
The code for this extends the same branch of
placedock
as playground 4 and
continues to use minikube.
I gave up trying to get things to work on linux with the kvm
or
kvm2
drivers. I should probably try the none
driver at some
point, but for now this work has been happening on a mac.
Update the next day: Tried the none
driver, worked fine, but need
to be aware of docker permissions.
There are two main chunks to make this work:
- Adapting the creation of the container and the creation and syncing of the database so that the database can be outside the container.
- Tweaking the kubernetes bits to get a horizontal pod autoscaler working.
Note: This isn't a tutorial on using kubernetes or placement, it's more of a trip report about the fun stuff I did over the weekend. If you try to follow this exactly for managing a placement service, it's not going to work very well. Think of this as a conversation starter. If you're interested in this stuff, let's talk. I recognize that my writing on this topic has become increasingly incoherent as I've gone off into the weeds of discovery. I will write a summary of all the playgrounds once they have reached a natural conclusion. For now, even in the weeds, they've taught me a lot.
Database Tweaking
In playground 4, the database is established when building the
container: every container gets its own sqlite db sitting there
ready and waiting for run time. This does not work if we want to use
a remote db and we want multiple containers talking to the same db.
Therefore the sync.py
script, which creates the database tables is
copied into the container at build time, but not actually run until
run time.
At run time, it gets the database connection url from an environment
variable, DB_STRING
. If it's not set, a default is used. We can
define the value in the kubernetes deployment.yaml
.
But wait, the container had only been running the uwsgi process. How
do we get it to use the environment variable, run sync.py
, and
only once that's done, start up the uwsgi server? Turns out we can
replace the existing docker CMD
with a script that does all that
stuff. In the Dockerfile
the end is adjusted to:
ADD startup.sh /
CMD ["sh", "-c", "/startup.sh"]
and startup.sh
is:
DB_STRING=${DB_STRING:-sqlite:////cats.db}
# Do substitutions in the template.
sed -e "s,{DB_CONNECTION},$DB_STRING," < /etc/nova/nova.conf.tmp >
/etc/nova/nova.conf
# establish the database
python3 /sync.py --config-file /etc/nova/nova.conf
# run the web server
/usr/sbin/uwsgi --ini /placement-uwsgi.ini
It would surprise me not one iota if there aren't cleaner ways than that, but that way worked.
The result of this is that each time a container starts, it connects
to the database described in $DB_STRING
and tries to create and
update the tables. If they are already created, it's happy. If
something else is in the midst of versioning the database an
exception is caught and ignored.
I had a postgresql
server running on a nearby VM, so I used that.
For the time being I simply added the necessary connection drivers
to the container at build time, but if it was required to be super
dynamic, then the python driver code could be installed at runtime.
Being super dynamic is not really in scope for my experiments.
After doing all that, I adjusted my deployment to have 4 replicas and made sure things worked. And it did. Onward.
Kubernetes Tweaking
Having 4 containers, either doing nothing or being overloaded, is not really taking advantage of some of the best stuff about kubernetes. What we really want, to be both more useful and more cool, is to create and destroy the placement pods as needed. This is done with a Horizontal Pod Autoscaler. You tell it the minimum and maximum number of pods you're willing to accept and a metric for determining the percent of resource consumption that is the boundary between OK and overloaded.
Here's the autoscaler.yaml
that works for me:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: placement-deployment
namespace: default
spec:
maxReplicas: 10
minReplicas: 1
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: placement-deployment
targetCPUUtilizationPercentage: 50
While this is a relatively simple concept and the tooling is
straightforward it took me quite some time to get this to work. I've
been using minikube 0.25.0
(the latest release as of this writing)
but running it at the maximum version of kubernetes that it
supports (v1.9.0). This leads to some conflicts.
In older versions, the expected way to manage autoscaling and metrics is to use heapster. Minikube includes an addon for this, but as installed it does not present a "rest" API for the information. That's okay for some unclear number of kubernetes versions back, but is not with v1.9.0.
Modern kubernetes has the Resource Metrics API. heapster can support that, but only if it is started with a particular flag. The other option is to start the kubernetes-controller with a particular flag so that autoscaling doesn't use the metrics API. I preferred to stay modern.
Unreleased minikube
adds the metrics server as an
addon.
I was able to copy that code into my minikube setup and establish
the service.
Next I discovered that my placement deployment needed to describe a
resource limit in order for the autoscaling to work. In hindsight
this is obvious. The autoscaling is done based on a percentage of a
limit the deployment sets for itself. For instance if you say that
say things should be scaled up when resource usage hits 50%
,
kubernetes says "50% of what?".
In my case that meant adjusting the containers section of
deployment.yaml
:
containers:
- name: placement
image: placedock:1.0
env:
- name: DB_STRING
value: postgresql+psycopg2://cdent@192.168.1.76/placement?client_encoding=utf8
ports:
- containerPort: 80
# We must set resources for scaling to work.
resources:
requests:
cpu: 250m
That last stanza is saying "we request 1/4 core worth of cpu". So now, the autoscaler is expressing when cpu utilization hits 50% (of 1/4 core), scale.
Note: If you're following along and using minikube
with
docker-machine
keep in mind that the default "machine" is pretty
small so you need to keep the resource request for each individual
container pretty small or you will soon overwhelm the machine. I had cpu
above set to 1000m
initially. Starting new pods was slow enough
that they never became ready.
When I finally got this working it was fun to watch (literally with
watch kubectl get hpa
). If you're starting from scratch it can
take a while for everything to warm up and be running, but
eventually you'll see low usage and low replicas (this output is
wide, you may need to scroll):
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
placement-deployment Deployment/placement-deployment 0% / 50% 1 10 1 1m
To load up the deployment and make it scale (assuming there's a bit
of data in the database, like the one resource provider that the
gabbi-run
in bootstrap.sh
will create) I did this:
export PLACEMENT=$(minikube service placement-deployment --url)
ab -n 100000 -c 100 -H 'x-auth-token: admin' $PLACEMENT/resource_providers
After a while the target usage raises above the 50%
target and
more replicas are created.
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
placement-deployment Deployment/placement-deployment 109% / 50% 1 10 3 8m
When the ab
was done and resource usage settled, all but one of
the containers were terminated. That one was working as
expected:
curl -H 'x-auth-token: admin' $PLACEMENT/resource_providers |json_pp
Even though I know that's exactly how it is supposed to work, it's still pretty cool. What next? I need to add forbidden traits support to the placement service, but after that I will likely revisit this stuff, either for more scale fun or the cat management mentioned in playground 4.
As always, please leave a comment or otherwise contact me if you have questions, there's something I've done weirdly, or you're doing similar stuff and there's an opportunity for us to collaborate.