About russellbryant

I'm an open source software engineer working for Red Hat on the OpenStack project.

A new Nova service: nova-conductor

The Grizzly release of OpenStack Nova will have a new service, nova-conductor. The service was discussed on the openstack-dev list and it was merged today. There is currently a configuration option that can be turned on to make it optional, but it is possible that by the time Grizzly is released, this service will be required.

One of the efforts that started during Folsom development and is scheduled to be completed in Grizzly is no-db-compute. In short, this effort is to remove direct database access from the nova-compute service. There are two main reasons we are doing this. Compute nodes are the least trusted part of a nova deployment, so removing direct database access is a step toward reducing the potential impact of a compromised compute node. The other benefit of no-db-compute is for upgrades. Direct database access complicates the ability to do live rolling upgrades. We’re working toward eventually making that possible, and this is a part of that.

All of the nova services use a messaging system (usually AMQP based) to communicate with each other. Many of the database accesses in nova-compute can be (and have been) removed by just sending more data in the initial message sent to nova-compute. However, that doesn’t apply to everything. That’s where the new service, nova-conductor, comes in.

The nova-conductor service is key to completing no-db-compute. Conceptually, it implements a new layer on top of nova-compute. It should *not* be deployed on compute nodes, or else the security benefits of removing database access from nova-compute will be negated. Just like other nova services such as nova-api or nova-scheduler, it can be scaled horizontally. You can run multiple instances of nova-conductor on different machines as needed for scaling purposes.

The methods exposed by nova-conductor will initially be relatively simple methods used by nova-compute to offload its database operations. Places where nova-compute previously did database access will now be talking to nova-conductor. However, we have plans in the medium to long term to move more and more of what is currently in nova-compute up to the nova-conductor layer. The compute service will start to look like a less intelligent slave service to nova-conductor. The conductor service will implement long running complex operations, ensuring forward progress and graceful error handling. This will be especially beneficial for operations that cross multiple compute nodes, such as migrations or resizes.

If you have any comments, questions, or suggestions for nova-conductor or the no-db-compute effort in general, please feel free to bring it up on the openstack-dev list.

OpenStack Design Summit and an Eye on Folsom

I just spent a week in San Francisco at the OpenStack design summit and conference. It was quite an amazing week and I’m really looking forward to the Folsom development cycle.  You can find a notes from various sessions held at the design summit on the OpenStack wiki.

Essex was the first release that I contributed to.  One thing I did was add Qpid support to both Nova and Glance as an alternative to using RabbitMQ.  Beyond that, I primarily worked on vulnerability management and other bug fixing.  For Folsom, I’m planning on working on some more improvements involving the inter-service messaging layer, also referred to as the rpc API, in Nova.

1) Moving the rpc API to openstack-common

The rpc API in Nova is used for private communication between nova services. As an example, when a tenant requests that a new virtual machine instance be created (either via the EC2 API or the OpenStack compute REST API), the nova-api service sends a message to the nova-scheduler service via the rpc API.  The nova-scheduler service decides where the instance is going to live and then sends a message to that compute node’s nova-compute service via the rpc API.

The other usage of the rpc API in Nova has been for notifications.  Notifications are asynchronous messages about events that happen within the system.  They can be used for monitoring and billing, among other things.  Strictly speaking, notifications aren’t directly tied to rpc.  A Notifier is an abstraction, of which using rpc is one of the implementations.  Glance also has notifications, including a set of Notifier implementations.  The code was the same at one point but has diverged quite a bit since.

We would like to move the notifiers into openstack-common.  Moving rpc into openstack-common is a prerequisite for that, so I’m going to knock that part out.  I’ve already written a few patches in that direction.  Once the rpc API is in openstack-common, other projects will be able to make use of it.  There was discussion of Quantum using rpc at the design summit, so this will be needed for that, too.  Another benefit is that the Heat project is using a copy of Nova’s rpc API right now, but will be able to migrate over to using the version from openstack-common.

2) Versioning the rpc API interfaces

The existing rpc API is pretty lightweight and seems to work quite well.  One limitation is that there is nothing provided to help with different versions of services talking to each other.  It may work … or it may not.  If it doesn’t, the failure you get could be something obvious, or it could be something really bad and bizarre where an operation fails half-way through, leaving things in a bad state.  I’d like to clean this up.

The end goal with this effort will be to make sure that as you upgrade from Essex to Folsom, any messages originating from an Essex service can and will be correctly processed by a Folsom service.  If that fails, then the failure should be immediate and obvious that a message was rejected due to a version issue.

3) Removing database access from nova-compute

This is by far the biggest effort of the 3 described here, and I won’t be tackling this one alone.  I want to help drive it, though.  This discussion came up in the design summit session about enhancements to Nova security.  By removing direct database access from the nova-compute service, we can help reduce the potential impact if a compute node were to be compromised.  There are two main parts to this effort.

The first part is to make more efficient use of messaging by sending full objects through the rpc API instead of IDs. For example, there are many cases where the nova-api service gets an instance object from the database, does its thing, and then just sends the instance ID in rpc message.  On the other side it has to go pull that same object out of the database.  We have to go through and change all cases like this to include the full object in the message.  In addition to the security benefit, it should be more efficient, as well. This doesn’t sound too complicated, and it isn’t really, but it’s quite a bit of work as there is a lot of code that needs to be changed. There will be some snags to deal with along the way, such as dealing with making sure all of the objects can be serialized properly.

Including full objects in messages is only part of the battle here.  The nova-compute service also does database updates.  We will have to come up with a new approach for this.  It will most likely end up being a new service of some sort, that handles state changes coming from compute nodes and makes the necessary database updates according to those state changes. I haven’t fully thought through the solution to this part yet. For example, ensuring that this service maintains proper order of messages without turning into a bottleneck in the overall system will be important.

Onward!

I’m sure I’ll work on other things in Folsom, as well, but those are three that I have on my mind right now.  OpenStack is a great project and I’m excited to be a part of it!

Automated Testing of Matahari in a chroot Using Mock

While I was at Digium, I helped build out some automated testing for Asterisk (posts on that here and here). We had a continuous integration server that did builds, ran C API unit tests, and ran some functional tests written in Python.

One of the things that we wanted to do with the Asterisk tests is to sandbox each instance of Asterisk. All of this is handled by an Asterisk Python class which creates a new set of directories for each Asterisk instance to store its files. This seems to have worked pretty well. One down side is that there is still the potential for Asterisk instances to step on each others files. All instances of Asterisk for all runs of the tests run within the same OS installation. One way to improve that that is a bit less heavy handed than starting up a bunch of new VMs all the time is to use a chroot.

Over the last week or so, I have been working on a similar setup for Matahari and wanted to share some information on how it works and in particular, some aspects that are different from what I’ve done before, including running all of the tests in a chroot.

What was already in place

Matahari uses Test-AutoBuild as its continuous integration server. The results of the latest build for Fedora can be found here.

When autobuild runs each hour, it clones the Matahari git repo and runs the autobuild.sh script in the top level directory. This script uses mock to build both matahari and mingw32-matahari packages. Mock handles setting up a chroot for the distribution and version you want to build a package for so you can easily do a lot of different builds on one machine. It also does a lot of caching to make this process much faster if you run it multiple times.

To install mock on Fedora, install the mock package. You will also need to add the user that will be running mock to the mock group.

To do a build in mock, you first need an srpm. The autobuild.sh script in Matahari has a make_srpm function that does this. Once you have an srpm, you can do a build with a single call to mock. The –root option specifies the type of chroot you want mock to use.

$ mock --root=fedora-16-x86_64 --rebuild *.src.rpm

The root, fedora-16-x86_64, is defined by a mock profile. When mock gets installed, a set of profiles gets installed, which can be found in /etc/mock/.

What’s new

While doing continuous builds is useful (for example, breaking the Windows build is not terribly uncommon), doing tests against these builds adds an enormous amount of additional value to the setup. Similar to what we have for Asterisk, we have two sets of tests for Matahari. We have a suite of C API unit tests, and we have a suite of functional tests written in Python.</P.

The first thing I did was update the setup to run the unit tests. I decided to modify the RPM spec file to optionally run the unit tests as a part of the build process. That patch is here. If the run_unit_tests variable is set, the spec file runs ctest after compilation is complete. The other aspect of this is getting this variable defined, which is pretty easy to do with mock.

$ mock --root=fedora-16-x86_64 --define "run_unit_tests 1" --rebuild *.src.rpm

Getting the Python-based functional tests running within mock is a bit more tricky, but not too bad. With the mock commands presented so far, a number of things are happening automatically, including setting up the chroot and cleaning up the chroot. To get these other tests running, we have to break up the processes into smaller steps. The first step is to initialize a chroot. We will also be using another option for all of the mock commands, –resultdir, which lets you specify where logs and the resulting RPMs from –rebuild are stored.

$ mock --root=fedora-16-x86_64 --resultdir=mockresults --init

The next step is to build the RPMs like before. In this case, we already have an initialized chroot, so we need to tell mock not to create a new one. We also need to tell mock not to clean up the chroot after the RPM builds are complete, because we want to perform more operations in there.

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--define="run_unit_tests 1" --no-clean --no-cleanup-after
--rebuild *.src.rpm

At this point, we have compiled the source, run the unit tests, and built RPMs. The chroot used to do all of this is still there. We can take the RPMs we just built and install them into the chroot.

$ mock --root=fedora-16-x86_64 --resultdir=mockresults --install mockresults/*.rpm

Now it’s time to set up the functional tests. We need to install some dependencies for the tests and then copy the tests themselves into the chroot.

$ . src/tests/deps.sh
$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--install ${MH_TESTS_DEPS}

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--copyin src/tests /matahari-tests

The dependencies of the tests are installed in the chroot and the tests themselves have been copied in. Now mock can be used to execute each set of tests.

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--shell "nosetests -v /matahari-tests/test_host_api.py"

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--shell "nosetests -v /matahari-tests/test_sysconfig_api.py"

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--shell "nosetests -v /matahari-tests/test_resource_api.py"

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--shell "nosetests -v /matahari-tests/test_service_api_minimal.py"

$ mock --root=fedora-16-x86_64 --resultdir=mockresults
--shell "nosetests -v /matahari-tests/test_network_api_minimal.py"

That’s it! The autobuild setup now tests compilation, unit tests, building RPMs, installing RPMs, and running and exercising all of the installed applications. It’s fast, too.

Final Thoughts

As with most things, there are some areas for improvement. For example, one glaring issue is that the entire setup is Fedora specific, or at least specific to a distribution that can use mock. However, I have at least heard of a similar tool to mock called pbuilder for dpkg based distributions, which could potentially be used in a similar way. I’m not sure.

There are also some issues with this approach specific to the Matahari project. Matahari includes a set of agents that provide systems management APIs. Testing of some of these APIs isn’t necessarily something you want to do on the same machine running the actual tests. To expand to much more complete coverage of these APIs, we’re going to have to break down and adopt an approach of spinning up VMs to run the Matahari agents. At that point, we may not run any of the functional tests from within mock anymore.

Mock is a very handy tool and it helped me to expand the automated build and test setup to get a lot more coverage in a shorter amount of time in a way that I was happy with. I hope this writeup helps you think about some other things that you could do with it.

Matahari: Systems Management and Monitoring APIs

I have been working at Red Hat for a few weeks now and have started getting some real work done. I wanted to share what I’m currently working on, and that is Matahari. Matahari is a cross-platform (Linux and Windows so far) collection of APIs accessible over local and remote interfaces for systems management and monitoring. What the heck does that mean? Read on, dear friends.

Architecture

I mentioned that Matahari is a collection of APIs. These APIs are accessible via a few different methods. The core of the functionality that we are implementing is done as C libraries. These can be used directly. However, we expect and intend for most users to access the functionality via one of the agents we provide. A Matahari Agent is an application that provides access to the features implemented in a Matahari Library via some transport. We are currently providing agents for D-Bus and QMF.

D-Bus is used quite heavily as a communications mechanism between applications on a single system. QMF, or the Qpid Management Framework, is used as a remote interface. QMF is a framework for building remote APIs on top of AMQP, an open protocol for messaging.

The agents are generally thin wrappers around a core library, so other transports could be added in the future if the need presents itself.

Current Features

So, what can you do with Matahari?

Matahari is still under heavy development, but there is already a decent amount of usable functionality.

  • Host – An agent for viewing and controlling hosts
    • View basic host information such as OS, hostname, CPU, RAM, load average, and more.
    • Host control (shutdown, reboot)
  • Networking – An agent for viewing and controlling network devices
    • Get a list of available network interfaces and information about them, such as IP and MAC addresses
    • Start and stop network interfaces
  • Services – An agent for viewing and controlling system services
    • List configured services
    • Start and stop services
    • Monitor the status of services
  • Sysconfig – Modify system configuration
    • Modify system configuration files (Linux)
    • Modify system registry (Windows)

More things that are in the works can be found on the project backlog.

Use Cases

An example of a project that already utilizes Matahari is Pacemaker-cloud, which is also under heavy development. Pacemaker-cloud utilizes both the Host and Services agents of Matahari. Being able to actively monitor and control services on remote hosts is a key element of being able to provide HA in a cloud deployment.

In addition to providing ready-to-use agents, we also provide some code that makes it easier to write a QMF agent so that third-parties can write their own Matahari agents. One example of this that already exists is libvirt-qmf, which is a Matahari agent that exposes libvirt functionality over QMF.

Join Us

If Matahari interests you, follow us on github and join our mailing list.  Thanks!

Taking On New Challenges

I began working on the Asterisk project in 2004.  My work on Asterisk has led to an exciting career in open source software engineering.  At the end of July 2011, I will be leaving Digium to take on some new challenges.  Specifically, I will be joining the Cloud Infrastructure team at Red Hat as a Principal Software Engineer where I will be working on projects related to clustering, high availability, and systems management.  Additionally, I will be moving back to Charleston, SC to be closer to my family.

While I will no longer be working with Asterisk full time, I still plan to participate in the open source community.  I am excited to watch both Asterisk and Asterisk SCF continue to evolve and grow.  The engineering team at Digium, as well as the global Asterisk development community are as strong as they have ever been and will continue to accomplish big things.

I have met many great people from all over the world in my time with Asterisk.  Thank you all for making the past seven years so memorable.

Best Regards,


Russell Bryant

 

Related Posts:

Debugging the Asterisk Dialplan with Verbose()

Leif Madsen and I are working on a new book, the Asterisk Cookbook. One of the recipes that I am working on this morning is a method of adding debug statements into the Asterisk dialplan. I came up with a GoSub() routine that can log messages based on log level settings that are global, per-device, or per-channel. Here’s a preview. I hope you find it useful!

Channel logging GoSub() routine.

  • ARG1 – Log level.
  • ARG2 – The log message.

Channel logging using this routine will be sent to the Asterisk console at verbose level 0, meaning that they will show up when you want them to regardless of the current “core set verbose” setting. This routine uses a different method, values in AstDB, to control what messages show up.

AstDB entries:

  • Family: ChanLog/ Key: all
    • If the log level is less than or equal to this value the message will be printed.
  • Family: ChanLog/ Key: channels/
    • This routine will also check for a channel specific debug setting. It will actually check for both the full channel name as well as just the part of the channel name before ‘-‘. This allows setting a debug level for all calls from a particular device. For example, a SIP channel may be “SIP/myphone-0011223344”. This routine will check:
      • Family: ChanLog/ Key: channels/SIP/myphone
      • Family: ChanLog/ Key: channels/SIP/myphone-0011223344

Example Dialplan Usage:

exten => 7201,1,GoSub(chanlog,s,1(1,${CHANNEL} has called ${EXTEN}))

Example of enabling debugging for a device from the Asterisk CLI:

*CLI> database put ChanLog SIP/myphone 3 

chanlog routine implementation:

[chanlog]

exten => s,1,GotoIf($[${DB_EXISTS(ChanLog/all)} = 0]?checkchan1)
    same => n,GotoIf($[${ARG1}  n(checkchan1),Set(KEY=ChanLog/channel/${CHANNEL})
    same => n,GotoIf($[${DB_EXISTS(${KEY})} = 0]?checkchan2)
    same => n,GotoIf($[${ARG1}  n(checkchan2),Set(KEY=ChanLog/channel/${CUT(CHANNEL,-,1)})
    same => n,GotoIf($[${DB_EXISTS(${KEY})} = 0]?return)
    same => n,GotoIf($[${ARG1}  n(return),Return() ; Return without logging
    same => n(log),Verbose(0,${ARG2})
    same => n,Return()

Asterisk 1.10 Update

I just posted an update on the development of Asterisk 1.10 to the asterisk-dev mailing list. Here is the content:

Greetings,

Shortly after the release of Asterisk 1.8, we had a developer meeting
and discussed some of the projects that people would like to see in
Asterisk 1.10 [1]. We discussed the schedule there a bit, as well. Now
that Asterisk 1.8 has settled down and we are well into the development
cycle for Asterisk 1.10, it is a good time to revisit the plans for the
next release.

At Digium, the biggest thing we have been working on for 1.10 so far is
replacing the media infrastructure in Asterisk. Most of the critical
and invasive plumbing work is done and has been merged into trunk. Next
we’re looking at building up some features on top of that, such as
adding more codecs, enhancing ConfBridge() to support additional
sampling rates (HD conferencing), adding features that exist in
MeetMe() but not ConfBridge(), and enhancing codec negotiation.

Of course, many others have been working on new developments as well. I
would encourage you to respond if you’d like to provide an update on
some new things that you’re working on.

We would like to release Asterisk 1.10 roughly a year after Asterisk
1.8. This will be a standard release, not LTS [2]. To have the release
out in the October time frame, we need to branch off 1.10 (feature
freeze) at the end of June. At that point we will begin the beta and RC
process. If you’re working on new development projects that you would
like to get into Asterisk 1.10, please keep this timeline in mind.

As always, comments and questions are welcome.

Thanks,

[1] https://wiki.asterisk.org/wiki/display/AST/AstriDevCon+2010
[2] https://wiki.asterisk.org/wiki/display/AST/Asterisk+Versions


Russell Bryant