Write – OpenStack Stein PTG

OpenStack Stein PTG

Posted on: Tue 18 September 2018

Category: openstack – Tags: placement, tc, opensource

For the TL;DR see the end.

The OpenStack PTG finished yesterday. For me it is six days of continuous meetings and discussions and one of the most busy and stressful events in my calendar. I always get ill. There are too few opportunities to refresh my introversion. The negotiating process is filled with hurdles. I often have a sense of not being fully authentic. I have a lot of sympathy for people who come away from the event making tweets like:

People use laughing interrupting my talk. People explans to me they aren’t attacking my personal, it is for my company. But It’s rude. Another sucks #PTG, i don’t want to back again. But I have a job, and fuck, they move future summit to Denver again.
— Alex Xu (@alex_xuhj) September 15, 2018

I wasn't in the nova room when that happened, so I don't know the full context, but whatever it was, it sounds wrong.

For some people the PTG is a grand time, for some it is challenging and difficult. For most it is a mix of both. Telling it how it is can help to make it better, even if it is uncomfortable.

There was a great deal of discussion about placement being extracted from nova. In the weeks leading up to the PTG there was a quite a lot of traffic, some of it summarized in a recent TC Report. Because I've been involved to a lot of that discussion I got to hear a lot of jokes about placement this past week. I'm sure most of them are meant well, but when the process of extraction has been so long, and sometimes so frustrating, the jokes are tiresome and another load on my already taxed brain. Much of the time they just made me want to punch someone or something.

I'd like to thank Eric Fried, Balazs Gibizer, Ed Leafe, Tetsuro Nakamura, and Matt Riedemann for doing a huge amount of work the past few weeks to get the extracted placement to a state where it has a working collection of tests and creates an operating service. As a team, we've made progress on a thing people have been saying they want for years: making nova smaller and decomposing it into smaller parts. Let's make it a trend.

The PTG was a long week, and I want to remember what happened, so I'm going to write down my experience of the event. This will not be the same as the experiences other people have had, but I hope it is useful.

On a long list of things I take for granted but forget that other people do not: If some piece of info doesn’t have an accessible, discoverable and eventually well-known URL it is none of True, Useful, Actionable, or Real.
— scandent (@anticdent) September 14, 2018

This was written partially on Saturday while I was still in Denver, and the rest on Tuesday after I returned. On Saturday I was already starting to forget details and now on Tuesday it's all fading away.

Sunday

Sunday afternoon we held the first of two Technical Committee sessions (the other was Friday). The agenda had a few different topics. The big one was reviewing the existing commitments that TC members have. No surprise: Most people are way over-extended and many tasks, both personal and organisational, fall on the floor. Based on that information we were able to remove several tasks from the TC Tracker. Items that will never get done or should not be the TC's responsibility.

We also talked about needing to manage applications to be official projects with a little more care and attention so that the process is less open-ended than it often is. To help with this there will be some up front time limits on application and we'll ensure that each application has a shepherd from the TC from earlier in the process.

Alan Clark, from the Foundation board, joined in on the conversation for a while. We discussed how to make the joint leadership meetings more effective and what the board needs from the TC: Information about the exciting and interesting things that are in progress in the technical community. To some extent this is needed to help the board members understand why they are bothering to participate and while there is always plenty of cool and interesting stuff going on, it is not always obvious.

This is useful advice as it helps to focus the purpose of the meetings, which sometime have a sense of "why are we here?"

Doug produced a summary of his notes from both days.

Lance Bragstad also made a report.

Monday

api-sig

The API-SIG had a room for all of Monday. At the first PTG in Atlanta we had two days, and used them both. Now we struggled to use all of Monday. In part this is because the great microversion battles have lost their energy, but also the SIG is currently going through a bit of a lull while other topics have more relevance.

We talked about most of the isues on the etherpad and kept notes on the discussion.

One interesting question was whether the SIG was interested in being a home for people interested in distributed APIs that are not based on HTTP. The answer is: "Sure, of course, but those people need to show up."

(People needing to show up was a theme throughout the week.)

Prior to the PTG we tried to kill off the common healthcheck middleware due to lack of attention. This threat drew out some interested parties and brought it back to life.

cyborg

Right after lunch Ed Leafe (the other API-SIG "leader" who was able to attend the PTG) and I were pulled away to attend a discussion about how cyborg interacts with nova and placement.

Tuesday

blazar

Tuesday morning there was a gathering of blazar, nova and placement people to figure out the best ways for blazar to interact. There are some notes on the related etherpad.

The two main outcomes from that were that it ought to be possible to satisfy many of the desired features by implementing a "not member of" functionality in placement which allows a caller to say "I'll accept resources that are not in this aggregate". A spec for that has been started.

That discussion made it clear that the existing member_of functionality is not entirely correct for nested resource providers. The currently functionality requires all the participants in a nested tree to be in an aggregate to show up in results. We decided this not what we want. A bug was created.

placement governance

Right before lunch there was an impromptu gathering of the various people involved in placement to create a list of technical milestones that need to be reached to allow placement to be an independent project. A good summary of that was posted to the mailing list.

It was a useful thing to do, the plan is solid, but nobody seemed to be in the right frame of mind to get into any of the personal, social, and political issues that have caused so much tension, either locally in the past few weeks, or in the last two years.

cinder

Later in the afternoon there was a meeting with cinder to see if there was a way that placement could be useful to cinder. It turns out there is a bit of a conceptual mismatch between placement and cinder.

Placement wants to represent a hard measurement of resources as they actually are while cinder, especially when "thin provisioning" is being used, needs to be more flexible. Representing that flexibility to placement in a way that is "safe" is difficult.

Dynamic inventory management is considered either too costly or too racy. I'm not certain this has to be the case. Architecturally, the system ought to be able to cope. There are some risks, but if we wanted to accommodate the risk it might be manageable and would make placement useful to more types of resources.

Wednesday

nova retrospective

Wednesday morning started with a nova cycle retrospective. There was limited attention to that etherpad before the event, but once we got rolling in person it turned out to be a pretty important topic. The main takeaway, for me, was that when we have to change priorities because of unforeseen events, we must trim the list of existing priorities to remove something. It was surprisingly difficult to get people to agree that this was necessary. Time and resources are finite. What other conclusion can we make?

placement topics

Then began a multi-day effort to cover all of the placement topics on the nova etherpad. A lot of this was done Wednesday, but in gaps on Thursday and Friday people returned to placement. Rather than trying to cover each day's topics on the day it happened, all the discussion is aggregated here in this section.

Interestingly (at least to me), during these discussion I had a very clear moment explaining why I often feel culturally alienated while working in the OpenStack community. While trying to argue that we should wait to do something, I use the term YAGNI. Few people in the room were familiar with it, and once it was explained, few people appeared to be sympathetic to the concept. In my experience this is a fundamental concept and driver of good software development.

This was then followed by a lack of sympathy for wanting or needing to define when a project can be considered "done". This too is something I find critical to software development: What are we striving for? How will we know when we get there? When do we get to stop? The reaction in the room seemed to be something along the lines of "never" and "why would we want to?".

These two experiences combined may explain why my experience of OpenStack development, especially in nova, feels so unconstrained and cancerous: There's a desire to satisfy everything, in advance, and to never be done. This is exactly opposite of what I want: narrow what we satisfy, do only what it is required, now, and figure out a way to reach done.

I suspect the reality of things is much less dramatic, but in the moment it felt that way and helped me understand things more.

Once through that, I felt like we managed to figure out some things that we need to do:

An idempotent upgrade script that makes it easy for a deployment to move placement data to a new home. Dan has started something.
Long term goals include managing affinity in placement, and enabling a form of service sharding so that one placement can manage multiple openstacks.
GET /allocation_candidates needs, eventually, an in_tree query parameter to allow the caller to say "give me candidates from these potential trees".
Highest priority at this time is getting nested resource providers working on the nova side, in nova-scheduler, in the resource tracker and in the virt drivers.
As other services increase their use of placement, and we have more diverse types of hardware being represented as resource providers, we badly need documentation that explains best practices for using resource classes and traits to model that hardware.
We need to create an os-resource-classes library, akin to os-traits, to contain the standard resource classes and manage the existing static identifiers associated with those classes. Since naming things is the hardest problem we spent a long time trying to figure out how to name such a thing. There are issues to be resolved with not causing pain for packagers and deployers.

While we figure that out I went ahead and created a cookiecutter-based os-resource-classes.
Getting shared providers working on the nova side is not an immediate concern in the face of the attention required to finish placement extraction and get nested providers working. However Tushar and his colleagues may devote some time to it.

Thursday

There was continued discussion of placement on Thursday, mostly noted above. Towards the end of this day I was running out of attention and working more on making minor changes to the placement repo. The energy required to give real attention to the room is so high, especially when it is couched in making sure I don't say something that's going to be taken the wrong way. After a while it is easier and a more productive use of time to give attention to something else. The people who are able to stick through a solid three days in the nova room are made of sterner stuff than me.

Friday

On Friday it was back to TC-related discussions, following the agenda on the etherpad. As stated above, Doug made a good summary email.

We started off by reviewing team health. Lots of different issues but a common thread is that many teams are suffering from lack of contributors. Some teams report burn out in their core reviewers. In the room we discussed why we sometimes I only find out about issues in team late; Why aren't project team members seeking out the assistance of the TC sooner? I suggested that perhaps there's insufficient evidence that the TC is empowered to resolve things.

Even if that's the case (we did not resolve that question), reaching out to the TC sooner than later is going to be beneficial for all involved as it increases awareness and can help direct people to the right resources.

There was a great deal of discussion in the room about making OpenStack (including the TC) more accessible to contributors from China. This resulted in a proposed resolution for a tc role in global reachout.

There was also a lot of discussion about strategies for increasing traction for SIGs, such as the public cloud sig. Some of this reflected the orchestration thread that Matt Riedemann started. During the discussion another resolution was proposed to Add long term goal proposal to community wide goals.

Discussion of the pending tech vision was around clarifying what the vision is for and making sure we publicize it well enough to get real feedback. Two main reasons to have the vision is to help drive the decision making process when evaluating projects that wish to be "official" and when selecting community wide goals. These are both important things but I think the main thing a vision we've all agreed to can provide is as a guide in any decisions in OpenStack. If we are able to point at a thing as the overarching goal of all of OpenStack, it becomes easier to say "no" to things that are clearly out of scope and thus have more energy for the things to which we clearly say "yes".

Throughout the discussion of project health and gaps in contribution I kept thinking it's important that we make gaps more visible, not come up with ways to do more with the resources we have. Many many people are expressing that they are overextended. We cannot take on more and remain healthy. If something is important enough people will come. If they don't come, the needs are either not important or not visible enough. The role of the TC should be to make things visible.

Feature wise we need to be more reactive and enabling, "we will make space for you to do this thing" and less "we're listening and will do this thing for you".

This includes allowing things to be less than perfect so their brokenness operates as an attracting influence. As a community we've been pre-disposed to thinking that if we don't make things proper from the start people will ignore us. I think we need to have some confidence that we are making useful stuff and make room for people to come and help us, for the sake of helping themselves.

What Now?

Based on what I've been able to read from various members of the community in blog posts, tweets, posts to the os-dev mailing list, it sounds like it was a pretty good week: We made some plans and figured out solutions to some difficult problems. The trick now is to follow through and focus on those things while avoiding adding yet more to the list.

For me, however, it is hard to say that it is worth it. I do not come away from these things motivated and focused. I'm overwhelmed by the vast array of things we seem to promise and concerned by the unaddressed disconnects and differences in our perceptions and actions. I'm sure once I've recovered I'll be back to making steady progress, but for now if I'm "telling it how it is" I have to wonder if the situation would be any different if I hadn't gone, or if none of us had gone.

Posted on: Tue 18 September 2018

Category: openstack – Tags: placement, tc, opensource