Thursday, 17 October 2013

Exalogic and CPU Oversubscription


Ever since the EECS release the facility to oversubscribe the physical CPU on an Exalogic has existed.  The documentation explains how the oversubscription is set and introduces the idea of a ratio and the CPU cap.  However I found it slightly light on detail so this posting will attempt to explain further just how the "vCPU to Physical CPU Threads ratio" and the "CPU cap" interact with each other.

The settings for this feature are editable as part of the virtual Data Center so impacts all tenants users of the rack.  Figure 1 shows the screen shot of the configurable  parameters.  These values can be changed at any time during the lifecycle of the virtual data-centre but the impact of changes must be understood.

Figure 1 - Editing the vDC properties for CPU oversubscription

vCPU to Physical CPU Ratio

The vCPU to pCPU ratio is used by the Exalogic Control placement algorithm.  By changing the ratio from 1:1 to say 1:2 you are effectively doubling the number of vCPUs that can be allocated to vServers in the datacenter.  When oversubscribed there is the potential for vServers to be competing with each other for access to the actual CPUs at this point the Xen scheduler will commence allocating access to the physical CPUs.

For example, if we consider the situation of a single compute node with 32 hardware threads.  (2 sockets * 8 cores * 2 threads per core = 32) and we are placing vServers with 1 vCPU then, with the ratio of 1:1, we would be able to place 32 vServers on the physcal compute node.  With the ratio set to 1:2 then we would be able to place 64 vServers on the compute node.

This value can be changed at any time and the change will impact all vServers in the datacenter.  Increasing the ratio is not a problem but should the ratio be changing from 1:2 to 1:1 then this is only valid if all existing vServers can fit into the new virtual data-centre.

Xen Hypervisor and CPU Scheduling

Underpinning the Exalogic rack is the Xen hypervisor and this has a scheduler that can control access to the physical CPUs. The scheduler is similar in principle to the linux scheduler however it referees between running guest OSes or domains, (Including the dom0 domain!)   ensuring that the compute power is shared out appropriately to all.  There are a number of scheduling algorithms available with Xen but the Credit Scheduler is the default and has had most development and testing.  You can check the scheduler running using the xm dmesg command.

# xm dmesg
 __  __            _  _    _   _____  _____     ____  __
 \ \/ /___ _ __   | || |  / | |___ / / _ \ \   / /  \/  |
  \  // _ \ '_ \  | || |_ | |   |_ \| | | \ \ / /| |\/| |
  /  \  __/ | | | |__   _|| |_ ___) | |_| |\ V / | |  | |
 /_/\_\___|_| |_|    |_|(_)_(_)____/ \___/  \_/  |_|  |_|
(XEN) Xen version 4.1.3OVM ( (gcc version 4.1.2 20080704 (Red Hat 4.1.2-48)) Wed Dec  5 09:11:29 PST 2012
(XEN) Latest ChangeSet: unavailable
(XEN) Bootloader: GNU GRUB 0.97
(XEN) Command line: console=com1,vga com1=9600,8n1 dom0_mem=2G

(XEN) ERST table is invalid
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Detected 3059.044 MHz processor.
(XEN) Initing memory sharing.
(XEN) Intel VT-d supported pag


On Exalogic when we create a vServer we define the number of vCPUs that are allocated to the guest.  Each vCPU equates to a single hardware thread and on creation of the vServer Xen will allocate the CPUs for the guest to use.  The xm vcpu-list command will show just which cores are allocated to a vServer.

# xm vcpu-list 
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
0004fb000006000036b78d88370acd11     5     0    22   -b-  315931.3 12-23
0004fb000006000036b78d88370acd11     5     1    15   -b-  210225.2 12-23
0004fb000006000036b78d88370acd11     5     2    17   -b-   94768.4 12-23
0004fb000006000036b78d88370acd11     5     3    13   -b-   99020.7 12-23
0004fb000006000036b78d88370acd11     5     4    19   -b-   97240.0 12-23
0004fb000006000036b78d88370acd11     5     5    16   -b-   90450.2 12-23
0004fb000006000036b78d88370acd11     5     6    20   -b-   85511.0 12-23
0004fb000006000036b78d88370acd11     5     7    14   -b-   74973.3 12-23
0004fb000006000036b78d88370acd11     5     8    23   -b-   75114.4 12-23
0004fb000006000036b78d88370acd11     5     9    16   -b-   65374.4 12-23
0004fb000006000036b78d88370acd11     5    10    14   -b-   64786.6 12-23
0004fb000006000036b78d88370acd11     5    11    20   -b-   64758.8 12-23

0004fb0000060000ea7b5bf71f3a806c     4     0     9   -b-  139777.8 0-11
0004fb0000060000ea7b5bf71f3a806c     4     1     2   -b-  173475.2 0-11
Domain-0                             0     0     4   -b-   69594.9 any cpu
Domain-0                             0     1     5   -b-   55519.9 any cpu

This in this example we can see that there is vServer running as ID 5 (actually the Exalogic Control vServer) which has been allocated 12 vCPUs of which Xen has determined that it will run on the CPUs 12-23.  Similarly the vServer ID 4 (the Exalogic Control proxy) has been allocated 2 vCPUs which will run on CPUs 0-11.

The credit scheduler will attempt to make sure that all vServers get access to the CPUs they have been allocated to and provided there is no contention (over-subcription) then there will be very little for the scheduler to do.  However if the compute node has guests demanding more compute than it physically has then the scheduler will kick in and do its best to share out the resource according to the scheduling rules.  The rules are a credit based scoring system which is based on two factors, a weight and a cap.
  • The weight is a number that indicates to the scheduler how much "credit" a vServer will get, thus a vServer with a weight of 1000 will get twices as much CPU as a vServer with a weight of 500 - once the system is under contention for the CPU.   
  • The Cap is an absolute limit on the amount of CPU time that a domain can be allocated, it is defined as a % of a vCPU.  Thus if set to 50 then a domain will only be allowed half the available cycles of a pCPU.
If we think about an example it might help explain how this operates.  Consider two domains that are over-subscribed to a vCPU.  Initially the scheduler will allocate a credit score to a domain.   The credit given is worked out based on the weight, the higher the weight the larger the credit score.  While the domain is consuming CPU cycles the scheduler steadily reduces the credit score for the domain.  The scheduler runs an accounting thread independently of the domains and if the credit for one domain drops (significantly) below the credit score for contenting domains then the other domain gets access to the processor and its credit score starts diminishing.  Periodically the accounting thread runs to top up all the credit scores.
If the cap is being used then effectively you are reducing the compute resource available to a domain as any one vServer will only ever be allocated compute up to the cap level.

On an Exalogic the weight is automatically assigned to a vServer and all vServers are given the same weight, the Cap is configurable in Exalogic Control.  It is set at the vDC level and the value becomes embedded into the vServer configuration file so changing the value in the vDC will only impact vServers created after the setting of the Cap.  An example output of the xm sched-credit output on a couple of compute nodes is shown below

[root@el01cn01 root]# xm sched-credit
Name                                ID Weight  Cap
0004fb000006000071d34e42e43bd82b    12    256    0
0004fb0000060000bb90063bf8efe7d7    13    256    0
Domain-0                             0    256    0

[root@el01cn01 root]# ssh root@el01cn07 xm sched-credit
Name                                ID Weight  Cap
0004fb000006000082dbcec62a976907    15  27500   90
Domain-0                             0    256    0

In this case we can see first the output on node 1 which shows two of the control stack vServers with a weight of 256 and output run on compute node 07 with a customer vServer which has a weight of 27500.  Dom0 gets a weight of 256 as per the control stack.  For the customer vServer the vDC had been changed to make the Cap 90% so it will never use more than 90% of a vCPU.

The Weight is automatically set by Exalogic Control and the Cap is set to whatever the vDC value is at the time that the vServer is created.  These values are put in to the vm.cfg file used to hold the configuration of the vServer.

[root@el01cn01 0004fb000006000082dbcec62a976907]# cat vm.cfg
kernel = '/usr/lib/xen/boot/hvmloader'
vif = []
OVM_simple_name = 'test-vserver'

name = '0004fb000006000082dbcec62a976907'
vncpasswd = ''
cpu_weight = 27500
pae = 1
memory = 4096
cpu_cap = 90
OVM_high_availability = True

Managing this on the Exalogic

So the key understanding required is:-
  1. The vCPU;pCPU ratio only impacts the vServer placement algorithm and enables Exalogic Control to place vServers onto the rack such that there can be more vServers demanding vCPU than there are available pCPUs.
    1. This impacts all vServers, those previously created and those still to be created.
    2. Changing this will not reduce the CPU available to a given vServer until the system comes under CPU contention!
    3.  The recommendation is not to make this ratio any larger than about 1:4 as a small vServer (with just 1 vCPU) when under contention the lack of compute power can lead to instability and timeouts.
  2. The CPU Cap is useful for the situation where you wish to use CPU oversubscription but want to have deterministic access to CPU.  Effectively this will reduce the power of your vServers but delay the time at which the Xen scheduler is used to control vServer access to physical CPU.  (Arguably rather than using the Cap it would be possible to simply reduce the number of CPUs allocated to each vServer and gain the same density/performance.)
    1. The Cap is set at the vDC level but changing it will not effect previously created vServers.  Thus it is possible to have a virtual deployment with different vServers having different caps.

1 comment:

  1. Great Work Don.

    Thanks for sharing such rich information...

    Cheers, JR