Comparing OpenStack Neutron ML2+OVS and OVN - Control Plane

By Russell Bryant

December 19, 2016

We have done a lot of performance testing of OVN over time, but one major thing missing has been an apples-to-apples comparison with the current OVS-based OpenStack Neutron backend (ML2+OVS). I’ve been working with a group of people to compare the two OpenStack Neutron backends. This is the first piece of those results: the control plane. Later posts will discuss data plane performance.

Control Plane Differences

The ML2+OVS control plane is based on a pattern seen throughout OpenStack. There is a series of agents written in Python. The Neutron server communicates with these agents using an rpc mechanism built on top of AMQP (RabbitMQ in most deployments, including our tests).

OVN takes a distributed database-driven approach. Configuration and state is managed through two databases: the OVN northbound and southbound databases. These databases are currently based on OVSDB. Instead of receiving updates via RPC, components are watching relevant portions of the database for changes and applying them locally. More detail about these components can be found in my post about the first release of OVN, or even more detail is in the ovn-architecture document.

OVN does not make use of any of the Neutron agents. Instead, all required functionality is implemented by ovn-controller and OVS flows. This includes things like security groups, DHCP, L3 routing, and NAT.

Hardware and Software

Our testing was done in a lab using 13 machines which were allocated to the following functions:

1 OpenStack TripleO Undercloud for provisioning
3 Controllers (OpenStack and OVN control plane services)
9 Compute Nodes (Hypervisors)

The hardware had the following specs:

- 2x E5-2620 v2 (12 total cores, 24 total threads)
  - 64GB RAM
  - 4 x 1TB SATA
  - 1 x Intel X520 Dual Port 10G

Software:

CentOS 7.2
OpenStack, OVS, and OVN from their master branches (early December, 2016)
Neutron configuration notes
- (OVN) 6 API workers, 1 RPC worker (since rpc is not used and neutron requires at least 1) for neutron-server on each controller (x3)
- (ML2+OVS) 6 API workers, 6 RPC workers for neutron-server on each controller (x3)
- (ML2+OVS) DVR was enabled

Test Configuration

The tests were run using OpenStack Rally. We used the Browbeat project to easily set up, configure, and run the tests, as well as store, analyze, and compare results. The rally portion of the browbeat configuration was:

rerun: 3
...
rally:
  enabled: true
  sleep_before: 5
  sleep_after: 5
  venv: /home/stack/rally-venv/bin/activate
  plugins:
    - netcreate-boot: rally/rally-plugins/netcreate-boot
    - subnet-router-create: rally/rally-plugins/subnet-router-create
    - neutron-securitygroup-port: rally/rally-plugins/neutron-securitygroup-port
  benchmarks:
    - name: neutron
      enabled: true
      concurrency:
        - 8
        - 16
        - 32 
      times: 500
      scenarios:
        - name: create-list-network
          enabled: true
          file: rally/neutron/neutron-create-list-network-cc.yml
        - name: create-list-port
          enabled: true
          file: rally/neutron/neutron-create-list-port-cc.yml
        - name: create-list-router
          enabled: true
          file: rally/neutron/neutron-create-list-router-cc.yml
        - name: create-list-security-group
          enabled: true
          file: rally/neutron/neutron-create-list-security-group-cc.yml
        - name: create-list-subnet
          enabled: true
          file: rally/neutron/neutron-create-list-subnet-cc.yml
    - name: plugins
      enabled: true
      concurrency:
        - 8
        - 16
        - 32 
      times: 500
      scenarios:
        - name: netcreate-boot
          enabled: true
          image_name: cirros
          flavor_name: m1.xtiny
          file: rally/rally-plugins/netcreate-boot/netcreate_boot.yml
        - name: subnet-router-create
          enabled: true
          num_networks:  10
          file: rally/rally-plugins/subnet-router-create/subnet-router-create.yml
        - name: neutron-securitygroup-port
          enabled: true
          file: rally/rally-plugins/neutron-securitygroup-port/neutron-securitygroup-port.yml

This configuration defines several scenarios to run. Each one is set to run 500 times, at three different concurrency levels. Finally, “rerun: 3” at the beginning says we run the entire configuration 3 times. This is a bit confusing, so let’s look at one example.

The “netcreate-boot” scenario is to create a network and boot a VM on that network. The configuration results in the following execution:

Run 1
- Create 500 VMs, each on their own network, 8 at a time, and then clean up
- Create 500 VMs, each on their own network, 16 at a time, and then clean up
- Create 500 VMs, each on their own network, 32 at a time, and then clean up
Run 2
- Create 500 VMs, each on their own network, 8 at a time, and then clean up
- Create 500 VMs, each on their own network, 16 at a time, and then clean up
- Create 500 VMs, each on their own network, 32 at a time, and then clean up
Run 3
- Create 500 VMs, each on their own network, 8 at a time, and then clean up
- Create 500 VMs, each on their own network, 16 at a time, and then clean up
- Create 500 VMs, each on their own network, 32 at a time, and then clean up

In total, we will have created 4500 VMs.

Results

Browbeat includes the ability to store all rally test results in elastic search and then display them using Kibana. A live dashboard of these results is on elk.browbeatproject.org.

The following tables show the results for the average times, 95th percentile, Maximum, and minimum times for all APIs executed throughout the test scenarios.

Analysis

The most drastic difference in results is for “nova.boot_server”. This is also the one piece of these tests that actually measures the time it takes to provision the network, and not just loading Neutron with configuration.

When Nova boots a server, it blocks waiting for an event from Neutron indicating that a port is ready before it sets the server state to ACTIVE and powers on the VM. Both ML2+OVS and OVN implement this mechanism. Our test scenario measured the time it took for servers to become ACTIVE.

Further tests were done on ML2+OVS and we were able to confirm that disabling this synchronization between Nova and Neutron brought the results back to being on par with the OVN results. This confirmed that the extra time was indeed spent waiting for Neutron to report that ports were ready.

To be clear, you should not disable this synchronization. The only reason you can disable it is because not all Neutron backends support it (ML2+OVS and OVN both do). It was put in place to avoid a race condition. It ensures that the network is actually ready for use before booting a VM. The issue is how long it’s taking Neutron to provision the network for use. Further analysis is needed to break down where Neutron (ML2+OVS) is spending most of its time in the provisioning process.