XEN CORE-SCHEDULING BENCHMARK ============================= Code: - Juergen's v1 patches, rebased git: https://github.com/jgross1/xen/tree/sched-v1-rebase Host: - CPU: 1 socket, 4 cores, 2 threads - RAM: 32 GB - distro: opneSUSE Tumbleweed - Kernel: 5.1.7 - HW Bugs Mitigations: fully disabled (both in Xen and dom0) - Filesys: XFS VMs: - vCPUs: either 8 or 4 - distro: opneSUSE Tumbleweed - Kernel: 5.1.7 - HW Bugs Mitigations: fully disabled - Filesys: XFS Configurations: - BM-HT: baremetal, HyperThreading enabled (8 CPUs) - BM-noHT: baremetal, Hyperthreading disabled (4 CPUs) - v-XEN-HT: Xen staging (no coresched patches applied), Xen dom0, 8 vCPUs, Hyperthreading enabled (8 CPUs) - v-XEN-noHT: Xen staging (no coresched patches applied), Xen dom0, 4 vCPUs, Hyperthreading disabled (4 CPUs) - XEN-HT: Xen with coresched patches applied, Xen dom0, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=cpu (default) - XEN-noHT: Xen with coresched patches applied, Xen dom0, 4 vCPUs, Hyperthreading disabled (4 CPUs), sched-gran=cpu (default) - XEN-csc: Xen with coresched patches applied, Xen dom0, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=core - v-HVM-v8-HT: Xen staging (no coresched patches applied), HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs) - v-HVM-v8-noHT: Xen staging (no coresched patches applied), HVM guest, 8 vCPUs, Hyperthreading enabled (4 CPUs) - HVM-v8-HT: Xen with coresched patches applied, HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=cpu (default) - HVM-v8-noHT: Xen with coresched patches applied, HVM guest, 8 vCPUs, Hyperthreading enabled (4 CPUs), sched-gran=cpu (default) - HVM-v8-csc: Xen with coresched patches applied, HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=core - v-HVM-v4-HT: Xen staging (no coresched patches applied), HVM guest, 4 vCPUs, Hyperthreading enabled (8 CPUs) - v-HVM-v4-noHT: Xen staging (no coresched patches applied), HVM guest, 4 vCPUs, Hyperthreading enabled (4 CPUs) - HVM-v4-HT: Xen with coresched patches applied, HVM guest, 4 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=cpu (default) - HVM-v4-noHT: Xen with coresched patches applied, HVM guest, 4 vCPUs, Hyperthreading enabled (4 CPUs), sched-gran=cpu (default) - HVM-v4-csc: Xen with coresched patches applied, HVM guest, 4 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=core - v-HVMx2-v8-HT: Xen staging (no coresched patches applied), 2x HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs) - v-HVMx2-v8-noHT: Xen staging (no coresched patches applied), 2x HVM guest, 8 vCPUs, Hyperthreading enabled (4 CPUs) - HVMx2-v8-HT: Xen with coresched patches applied, 2x HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=cpu (default) - HVMx2-v8-noHT: Xen with coresched patches applied, 2x HVM guest, 8 vCPUs, Hyperthreading enabled (4 CPUs), sched-gran=cpu (default) - HVMx2-v8-csc: Xen with coresched patches applied, 2x HVM guest, 8 vCPUs, Hyperthreading enabled (8 CPUs), sched-gran=core Benchmarks: - Suite: MMTests, https://github.com/gormanm/mmtests (e.g., see: https://lkml.org/lkml/2019/4/25/588) * Stream: pure memory benchmark (various kind of mem-ops done in parallel). Parallelism is NR_CPUS/2 tasks (i.e, . Ideally, we wouldn't see any difference between HT, noHT and csc; * Kernbench: builds a kernel, with varying number of compile jobs. HT is, in general, known to help, at least a little, as it let us do more parallel builds; * Hackbench: communication (via pipes, in this case) between group of processes. As we deal with _groups_ of tasks, we're already in saturation with 1 group, hence we expect noHT to suffer; * mutilate: load generator for memcached, with high request rate; * netperf-{udp,tcp,unix}: two communicating tasks. Without any pinning (either at hypervisor and dom0/guest level), we expect HT to play a role, as depending on where the two task are scheduler (i.e., whether on two core of the same thread, or not) performance may vary; * pgioperfbench: Postgres microbenchmark (I wanted to use pgbench, but had issues); * iozone: I have the number, but they're not shown here (needs time for looking at them and properly present them). Raw results: - On my test box. I'll put them somewhere soon... :-) Result comparisons: 1) BM-HT vs. BM-noHT: check whether, on baremetal, for the given benchmark, HT has a positive or negative impact on performance 2) v-XEN-HT vs. XEN-HT: check the overhead introduced by having the coresched patches applied, but not in use, in case HT is enabled. This tells us how expensive it is to have these patches in, for someone that does not need or want to use them, and keeps HT enabled; 3) v-XEN-noHT vs. XEN-noHT: check the overhead introduced by having the coresched patches applied, but not in use, in case HT is disabled. This tells us how expensive it is to have these patches in, for someone that does not need or want them, and keeps HT disabled; 4) XEN-HT vs. XEN-noHT vs. XEN-csc: with coresched patches applied, compare sched-gran=cpu and HT enabled (current default), with turning HT off and with sched-gran=core (and HT enabled, of course). This tells us how much perf we lose (or gain!) when using core-scheduling, as compared to both default and disabling hyperthreading disabled. Ideally, we'd end up a bit slower than default, but not as slow(er) as with HT disabled. 5) HVM-v8-HT vs. HVM-v8-noHT vs. HVM-v8-csc: same as above, but benchmarks are running in an HVM guest with 8 vCPUs. 6) HVM-v4-HT vs. HVM-v4-noHT vs. HVM-v4-csc: same as 5), but the HVM VM has now only 4 vCPUs. 7) HVMx2-v8-HT vs. HVMx2-v8-noHT vs. HVMx2-v8-csc: now there are 2 HVM guests, both with 8 vCPUs. One runs synthetic workloads (CPU and memory) producing ~600% CPU load. The other runs our benchmarks. So, the host is in overload by 2x factor, in terms of number of vCPUs vs. number of pCPUs (not counting dom0!). In terms of CPU load, it's almost the same, but harder to tell exactly, as the synthetic load is ~600% (and not 800%) and the various benchmarks have variable CPU load. STREAM ====== 1) BM-HT BM-noHT MB/sec copy 32623.20 ( 0.00%) 33906.18 ( 3.93%) MB/sec scale 22056.18 ( 0.00%) 23005.44 ( 4.30%) MB/sec add 25827.70 ( 0.00%) 26707.06 ( 3.40%) MB/sec triad 25771.70 ( 0.00%) 26705.98 ( 3.63%) I was expecting nearly identical performance between 'HT' and 'noHT'. For sure, I wasn't expecting 'noHT' to regress. It's actually going a little faster, which does indeed makes sense (as we don't use any task pinning). 2) v-XEN-HT XEN-HT MB/sec copy 33076.00 ( 0.00%) 33112.76 ( 0.11%) MB/sec scale 24170.98 ( 0.00%) 24225.04 ( 0.22%) MB/sec add 27060.58 ( 0.00%) 27195.52 ( 0.50%) MB/sec triad 27172.34 ( 0.00%) 27296.42 ( 0.46%) 3) v-XEN-noHT XEN-noHT MB/sec copy 33054.78 ( 0.00%) 33070.62 ( 0.05%) MB/sec scale 24198.12 ( 0.00%) 24335.76 ( 0.57%) MB/sec add 27084.58 ( 0.00%) 27360.86 ( 1.02%) MB/sec triad 27199.12 ( 0.00%) 27468.54 ( 0.99%) 4) XEN-HT XEN-noHT XEN-csc MB/sec copy 33112.76 ( 0.00%) 33070.62 ( -0.13%) 31811.06 ( -3.93%) MB/sec scale 24225.04 ( 0.00%) 24335.76 ( 0.46%) 22745.18 ( -6.11%) MB/sec add 27195.52 ( 0.00%) 27360.86 ( 0.61%) 25088.74 ( -7.75%) MB/sec triad 27296.42 ( 0.00%) 27468.54 ( 0.63%) 25170.02 ( -7.79%) I wasn't expecting a degradation for 'csc' (as, in fact, we don't see any in 'noHT'). Perhaps, re-run the 'csc' case with 2 threads, and see if it makes any difference. 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc MB/sec copy 33338.56 ( 0.00%) 34168.32 ( 2.49%) 29568.34 ( -11.31%) MB/sec scale 22001.94 ( 0.00%) 23023.06 ( 4.64%) 18635.30 ( -15.30%) MB/sec add 25442.48 ( 0.00%) 26676.96 ( 4.85%) 21632.14 ( -14.98%) MB/sec triad 26376.48 ( 0.00%) 26542.96 ( 0.63%) 21751.54 ( -17.53%) (Unexpected) Impact is way more than just "noticeable". Furthermore, 'HT' and 'noHT' results are only a little worse than the PV (well, dom0) case, while 'csc' is much worse. I suspect this may be caused by the benchmark's tasks running on siblings, even if there are fully idle cores, but this needs more investigation. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc MB/sec copy 31917.34 ( 0.00%) 33820.44 ( 5.96%) 31186.82 ( -2.29%) MB/sec scale 20741.28 ( 0.00%) 22320.82 ( 7.62%) 19471.54 ( -6.12%) MB/sec add 25120.08 ( 0.00%) 26553.58 ( 5.71%) 22348.62 ( -11.03%) MB/sec triad 24979.40 ( 0.00%) 26075.40 ( 4.39%) 21988.46 ( -11.97%) Not as bad as the 8 vCPU case, but still not great. 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc MB/sec copy 28821.94 ( 0.00%) 22508.84 ( -21.90%) 27223.16 ( -5.55%) MB/sec scale 20544.64 ( 0.00%) 15269.06 ( -25.68%) 17813.02 ( -13.30%) MB/sec add 22774.16 ( 0.00%) 16733.48 ( -26.52%) 19659.98 ( -13.67%) MB/sec triad 22163.52 ( 0.00%) 16508.20 ( -25.52%) 19722.54 ( -11.01%) Under load, core-scheduling does better than 'noHT'. Good! :-D KERNBENCH ========= Done runs with 2, 4,8 or 16 threads. So, with HT on, we saturate at 8 threads, with HT off, at 4. ------------------------------------------------------------ 1) BM-HT BM-noHT Amean elsp-2 199.46 ( 0.00%) 196.25 * 1.61%* Amean elsp-4 114.96 ( 0.00%) 108.73 * 5.42%* Amean elsp-8 83.65 ( 0.00%) 109.77 * -31.22%* Amean elsp-16 84.46 ( 0.00%) 112.61 * -33.33%* Stddev elsp-2 0.09 ( 0.00%) 0.05 ( 46.12%) Stddev elsp-4 0.08 ( 0.00%) 0.19 (-126.04%) Stddev elsp-8 0.15 ( 0.00%) 0.15 ( -2.80%) Stddev elsp-16 0.03 ( 0.00%) 0.50 (-1328.97%) Things are (slightly) better, on 'noHT', before saturation, but get substantially worse after that point. 2) v-XEN-HT XEN-HT Amean elsp-2 234.13 ( 0.00%) 233.63 ( 0.21%) Amean elsp-4 139.89 ( 0.00%) 140.21 * -0.23%* Amean elsp-8 98.15 ( 0.00%) 98.22 ( -0.08%) Amean elsp-16 99.05 ( 0.00%) 99.65 * -0.61%* Stddev elsp-2 0.40 ( 0.00%) 0.14 ( 64.63%) Stddev elsp-4 0.11 ( 0.00%) 0.14 ( -33.77%) Stddev elsp-8 0.23 ( 0.00%) 0.40 ( -68.71%) Stddev elsp-16 0.05 ( 0.00%) 0.47 (-855.23%) 3) v-XEN-noHT XEN-noHT Amean elsp-2 233.23 ( 0.00%) 233.16 ( 0.03%) Amean elsp-4 130.00 ( 0.00%) 130.43 ( -0.33%) Amean elsp-8 132.86 ( 0.00%) 132.34 * 0.39%* Amean elsp-16 135.61 ( 0.00%) 135.51 ( 0.07%) Stddev elsp-2 0.10 ( 0.00%) 0.06 ( 39.87%) Stddev elsp-4 0.04 ( 0.00%) 0.67 (-1795.94%) Stddev elsp-8 0.37 ( 0.00%) 0.04 ( 89.24%) Stddev elsp-16 0.20 ( 0.00%) 0.34 ( -69.82%) 4) XEN-HT XEN-noHT XEN-csc Amean elsp-2 233.63 ( 0.00%) 233.16 * 0.20%* 248.12 * -6.20%* Amean elsp-4 140.21 ( 0.00%) 130.43 * 6.98%* 145.65 * -3.88%* Amean elsp-8 98.22 ( 0.00%) 132.34 * -34.73%* 98.15 ( 0.07%) Amean elsp-16 99.65 ( 0.00%) 135.51 * -35.98%* 99.58 ( 0.07%) Stddev elsp-2 0.14 ( 0.00%) 0.06 ( 56.88%) 0.19 ( -36.79%) Stddev elsp-4 0.14 ( 0.00%) 0.67 (-369.64%) 0.57 (-305.25%) Stddev elsp-8 0.40 ( 0.00%) 0.04 ( 89.89%) 0.03 ( 91.88%) Stddev elsp-16 0.47 ( 0.00%) 0.34 ( 27.80%) 0.47 ( 0.53%) 'csc' proves to be quite effective for this workload. It does cause some small regressions at low job count, but it lets us get back all the perf we lost when disabling Hyperthreading. 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Amean elsp-2 202.45 ( 0.00%) 205.87 * -1.69%* 218.87 * -8.11%* Amean elsp-4 121.90 ( 0.00%) 115.41 * 5.32%* 128.78 * -5.64%* Amean elsp-8 85.94 ( 0.00%) 125.71 * -46.28%* 87.26 * -1.54%* Amean elsp-16 87.37 ( 0.00%) 128.52 * -47.09%* 88.03 * -0.76%* Stddev elsp-2 0.09 ( 0.00%) 0.34 (-299.25%) 1.30 (-1433.24%) Stddev elsp-4 0.44 ( 0.00%) 0.27 ( 38.66%) 0.60 ( -36.58%) Stddev elsp-8 0.14 ( 0.00%) 0.22 ( -63.81%) 0.35 (-153.50%) Stddev elsp-16 0.02 ( 0.00%) 0.40 (-1627.17%) 0.32 (-1300.67%) Trend is pretty much the same than in the dom0 case. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Amean elsp-2 206.05 ( 0.00%) 208.49 * -1.18%* 237.58 * -15.30%* Amean elsp-4 114.53 ( 0.00%) 115.46 * -0.82%* 162.61 * -41.98%* Amean elsp-8 127.22 ( 0.00%) 117.63 * 7.54%* 166.22 * -30.66%* Amean elsp-16 133.70 ( 0.00%) 120.53 * 9.85%* 170.05 * -27.19%* Stddev elsp-2 0.26 ( 0.00%) 0.24 ( 7.38%) 0.34 ( -30.97%) Stddev elsp-4 0.29 ( 0.00%) 0.23 ( 20.88%) 0.24 ( 17.16%) Stddev elsp-8 2.21 ( 0.00%) 0.12 ( 94.48%) 0.24 ( 89.01%) Stddev elsp-16 2.00 ( 0.00%) 0.57 ( 71.34%) 0.34 ( 82.87%) Quite bad, this time. Basically, with only 4 vCPUs in the VM, we never saturate the machine, even in the 'noHT' case. In fact, 'noHT' goes similarly than 'HT' (even a little better, actually), as it could have been expected. OTOH, 'csc' regresses badly. I did some more investigation, and found out that, with core-scheduling, we no longer spread the work among cores. E.g., with 4 cores and 4 busy vCPUs, in the 'HT' case, we run (most of the time) one vCPU on each core. In the 'csc' case, they often on 2 cores and leaves 2 cores idle. This means lower, but also more deterministic, performance. Whether or not is something we like, needs to be discussed. 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Amean elsp-2 288.82 ( 0.00%) 437.79 * -51.58%* 339.86 * -17.67%* Amean elsp-4 196.18 ( 0.00%) 294.95 * -50.35%* 242.91 * -23.82%* Amean elsp-8 153.27 ( 0.00%) 229.25 * -49.57%* 182.50 * -19.07%* Amean elsp-16 156.59 ( 0.00%) 235.63 * -50.48%* 185.82 * -18.67%* Stddev elsp-2 0.20 ( 0.00%) 0.34 ( -67.23%) 2.56 (-1175.73%) Stddev elsp-4 0.48 ( 0.00%) 0.07 ( 86.25%) 0.41 ( 15.93%) Stddev elsp-8 0.39 ( 0.00%) 0.81 (-109.26%) 1.30 (-235.77%) Stddev elsp-16 0.88 ( 0.00%) 0.87 ( 1.97%) 1.75 ( -98.28%) In overload, 'csc' regresses noticeably, as compared to 'HT'. But improves things a lot, as compared to 'noHT'. HACKBENCH-PROCESS-PIPE ====================== 1) BM-HT BM-noHT Amean 1 0.3838 ( 0.00%) 0.6752 * -75.92%* Amean 3 1.1718 ( 0.00%) 1.9740 * -68.46%* Amean 5 1.9334 ( 0.00%) 2.5968 * -34.31%* Amean 7 2.6592 ( 0.00%) 2.6656 ( -0.24%) Amean 12 3.0228 ( 0.00%) 3.2494 ( -7.50%) Amean 18 3.4848 ( 0.00%) 3.7932 * -8.85%* Amean 24 4.1308 ( 0.00%) 5.2044 * -25.99%* Amean 30 4.7290 ( 0.00%) 5.8558 * -23.83%* Amean 32 4.9450 ( 0.00%) 6.2686 * -26.77%* Stddev 1 0.0391 ( 0.00%) 0.0308 ( 21.27%) Stddev 3 0.0349 ( 0.00%) 0.1551 (-344.67%) Stddev 5 0.1620 ( 0.00%) 0.1590 ( 1.90%) Stddev 7 0.1166 ( 0.00%) 0.3178 (-172.48%) Stddev 12 0.3413 ( 0.00%) 0.4880 ( -43.00%) Stddev 18 0.2202 ( 0.00%) 0.0784 ( 64.40%) Stddev 24 0.4053 ( 0.00%) 0.3042 ( 24.95%) Stddev 30 0.1251 ( 0.00%) 0.1655 ( -32.33%) Stddev 32 0.0344 ( 0.00%) 0.2019 (-486.80%) 'noHT' hurts (as could have been anticipated), even quite significantly in some cases (actually, I'm not sure why 7, 12 and 18 groups went so much better than others...) 2) v-XEN-HT XEN-HT Amean 1 0.7356 ( 0.00%) 0.7470 ( -1.55%) Amean 3 2.3506 ( 0.00%) 2.4632 ( -4.79%) Amean 5 4.1084 ( 0.00%) 4.0418 ( 1.62%) Amean 7 5.8314 ( 0.00%) 5.9784 ( -2.52%) Amean 12 9.9694 ( 0.00%) 10.4374 * -4.69%* Amean 18 14.2844 ( 0.00%) 14.1366 ( 1.03%) Amean 24 19.0304 ( 0.00%) 17.6314 ( 7.35%) Amean 30 21.8334 ( 0.00%) 21.9030 ( -0.32%) Amean 32 22.8946 ( 0.00%) 22.0216 * 3.81%* Stddev 1 0.0213 ( 0.00%) 0.0327 ( -53.61%) Stddev 3 0.1634 ( 0.00%) 0.0952 ( 41.73%) Stddev 5 0.3934 ( 0.00%) 0.1954 ( 50.33%) Stddev 7 0.1796 ( 0.00%) 0.2809 ( -56.40%) Stddev 12 0.1695 ( 0.00%) 0.3725 (-119.74%) Stddev 18 0.1392 ( 0.00%) 0.4003 (-187.67%) Stddev 24 1.5323 ( 0.00%) 0.7330 ( 52.16%) Stddev 30 0.4214 ( 0.00%) 1.0668 (-153.14%) Stddev 32 0.8221 ( 0.00%) 0.3849 ( 53.18%) With the core-scheduling patches applied, even if 'csc' is not used, there were runs where the overhead is noticeable (but also other runs where patched Xen was faster, so...) 3) v-XEN-noHT XEN-noHT Amean 1 1.2154 ( 0.00%) 1.2200 ( -0.38%) Amean 3 3.9274 ( 0.00%) 3.9302 ( -0.07%) Amean 5 6.6430 ( 0.00%) 6.6584 ( -0.23%) Amean 7 9.2772 ( 0.00%) 9.8144 ( -5.79%) Amean 12 14.8220 ( 0.00%) 14.4464 ( 2.53%) Amean 18 21.6266 ( 0.00%) 20.2166 * 6.52%* Amean 24 27.2974 ( 0.00%) 26.1082 * 4.36%* Amean 30 32.2406 ( 0.00%) 31.1702 * 3.32%* Amean 32 34.8984 ( 0.00%) 33.8204 * 3.09%* Stddev 1 0.0727 ( 0.00%) 0.0560 ( 22.95%) Stddev 3 0.2564 ( 0.00%) 0.1332 ( 48.04%) Stddev 5 0.6043 ( 0.00%) 0.5454 ( 9.75%) Stddev 7 0.6785 ( 0.00%) 0.4508 ( 33.56%) Stddev 12 0.5074 ( 0.00%) 0.8115 ( -59.92%) Stddev 18 0.9622 ( 0.00%) 0.8868 ( 7.84%) Stddev 24 0.4904 ( 0.00%) 0.8453 ( -72.38%) Stddev 30 0.5542 ( 0.00%) 0.6499 ( -17.27%) Stddev 32 0.4986 ( 0.00%) 0.7965 ( -59.77%) 4) XEN-HT XEN-noHT XEN-csc Amean 1 0.7470 ( 0.00%) 1.2200 * -63.32%* 0.7160 * 4.15%* Amean 3 2.4632 ( 0.00%) 3.9302 * -59.56%* 2.2284 * 9.53%* Amean 5 4.0418 ( 0.00%) 6.6584 * -64.74%* 4.0492 ( -0.18%) Amean 7 5.9784 ( 0.00%) 9.8144 * -64.16%* 5.8718 ( 1.78%) Amean 12 10.4374 ( 0.00%) 14.4464 * -38.41%* 10.1488 ( 2.77%) Amean 18 14.1366 ( 0.00%) 20.2166 * -43.01%* 14.7450 * -4.30%* Amean 24 17.6314 ( 0.00%) 26.1082 * -48.08%* 18.0072 ( -2.13%) Amean 30 21.9030 ( 0.00%) 31.1702 * -42.31%* 21.1058 ( 3.64%) Amean 32 22.0216 ( 0.00%) 33.8204 * -53.58%* 22.7300 ( -3.22%) Stddev 1 0.0327 ( 0.00%) 0.0560 ( -71.05%) 0.0168 ( 48.71%) Stddev 3 0.0952 ( 0.00%) 0.1332 ( -39.92%) 0.0944 ( 0.88%) Stddev 5 0.1954 ( 0.00%) 0.5454 (-179.08%) 0.3064 ( -56.79%) Stddev 7 0.2809 ( 0.00%) 0.4508 ( -60.47%) 0.5762 (-105.11%) Stddev 12 0.3725 ( 0.00%) 0.8115 (-117.83%) 0.3594 ( 3.53%) Stddev 18 0.4003 ( 0.00%) 0.8868 (-121.54%) 0.2580 ( 35.56%) Stddev 24 0.7330 ( 0.00%) 0.8453 ( -15.32%) 0.8976 ( -22.45%) Stddev 30 1.0668 ( 0.00%) 0.6499 ( 39.08%) 0.8666 ( 18.76%) Stddev 32 0.3849 ( 0.00%) 0.7965 (-106.96%) 1.4428 (-274.86%) This looks similar to 'kernbench'. 'noHT' hurts the workload badly, but 'csc' seems to come quite a bit to the rescue! 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Amean 1 0.8992 ( 0.00%) 1.1826 * -31.52%* 0.9462 * -5.23%* Amean 3 2.4524 ( 0.00%) 2.4990 ( -1.90%) 2.5514 * -4.04%* Amean 5 3.6710 ( 0.00%) 3.5258 ( 3.96%) 3.9270 * -6.97%* Amean 7 4.6588 ( 0.00%) 4.6430 ( 0.34%) 5.0420 ( -8.23%) Amean 12 7.7872 ( 0.00%) 7.6310 ( 2.01%) 7.6498 ( 1.76%) Amean 18 10.9800 ( 0.00%) 12.2772 ( -11.81%) 11.1940 ( -1.95%) Amean 24 16.5552 ( 0.00%) 18.1368 ( -9.55%) 15.1922 ( 8.23%) Amean 30 17.1778 ( 0.00%) 24.0352 * -39.92%* 17.7776 ( -3.49%) Amean 32 19.1966 ( 0.00%) 25.3104 * -31.85%* 19.8716 ( -3.52%) Stddev 1 0.0306 ( 0.00%) 0.0440 ( -44.08%) 0.0419 ( -36.94%) Stddev 3 0.0681 ( 0.00%) 0.1289 ( -89.35%) 0.0975 ( -43.30%) Stddev 5 0.1214 ( 0.00%) 0.2722 (-124.14%) 0.2080 ( -71.33%) Stddev 7 0.5810 ( 0.00%) 0.2687 ( 53.75%) 0.3038 ( 47.71%) Stddev 12 1.4867 ( 0.00%) 1.3196 ( 11.24%) 1.5775 ( -6.11%) Stddev 18 1.9110 ( 0.00%) 1.8377 ( 3.83%) 1.3230 ( 30.77%) Stddev 24 1.4038 ( 0.00%) 1.2938 ( 7.84%) 1.2301 ( 12.37%) Stddev 30 2.0035 ( 0.00%) 1.3529 ( 32.47%) 2.4480 ( -22.19%) Stddev 32 1.8277 ( 0.00%) 1.1786 ( 35.52%) 2.4069 ( -31.69%) Similar to the dom0 case, but it's rather weird how the 'noHT' case is, in some cases, quite good. Even better than 'HT'. :-O 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Amean 1 1.4962 ( 0.00%) 1.4424 ( 3.60%) 1.6824 * -12.44%* Amean 3 2.5826 ( 0.00%) 2.9692 ( -14.97%) 2.9984 ( -16.10%) Amean 5 3.4330 ( 0.00%) 3.5064 ( -2.14%) 4.7144 * -37.33%* Amean 7 5.3896 ( 0.00%) 4.5122 ( 16.28%) 5.7828 ( -7.30%) Amean 12 7.6862 ( 0.00%) 8.1164 ( -5.60%) 10.3868 * -35.14%* Amean 18 10.8628 ( 0.00%) 10.8600 ( 0.03%) 14.5226 * -33.69%* Amean 24 16.1058 ( 0.00%) 13.9238 ( 13.55%) 17.3746 ( -7.88%) Amean 30 17.9462 ( 0.00%) 20.1588 ( -12.33%) 22.2760 * -24.13%* Amean 32 20.8922 ( 0.00%) 15.2004 * 27.24%* 22.5168 ( -7.78%) Stddev 1 0.0353 ( 0.00%) 0.0996 (-181.94%) 0.0573 ( -62.13%) Stddev 3 0.4556 ( 0.00%) 0.6097 ( -33.81%) 0.4727 ( -3.73%) Stddev 5 0.7936 ( 0.00%) 0.6409 ( 19.25%) 1.0294 ( -29.71%) Stddev 7 1.3387 ( 0.00%) 0.9335 ( 30.27%) 1.5652 ( -16.92%) Stddev 12 1.1304 ( 0.00%) 1.8932 ( -67.48%) 1.6302 ( -44.21%) Stddev 18 2.6187 ( 0.00%) 2.2309 ( 14.81%) 0.9304 ( 64.47%) Stddev 24 2.2138 ( 0.00%) 1.6892 ( 23.70%) 1.4706 ( 33.57%) Stddev 30 1.6264 ( 0.00%) 2.1632 ( -33.01%) 1.6574 ( -1.91%) Stddev 32 3.5272 ( 0.00%) 2.5166 ( 28.65%) 2.0956 ( 40.59%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Amean 1 1.4736 ( 0.00%) 1.2224 * 17.05%* 1.6544 * -12.27%* Amean 3 3.1764 ( 0.00%) 2.4984 * 21.34%* 2.9720 * 6.43%* Amean 5 4.7694 ( 0.00%) 3.4870 * 26.89%* 4.4406 * 6.89%* Amean 7 6.3958 ( 0.00%) 4.6320 * 27.58%* 5.9150 * 7.52%* Amean 12 11.0484 ( 0.00%) 7.6534 * 30.73%* 9.7064 ( 12.15%) Amean 18 16.8658 ( 0.00%) 12.8028 * 24.09%* 17.0654 ( -1.18%) Amean 24 23.6798 ( 0.00%) 18.8692 * 20.32%* 24.1742 ( -2.09%) Amean 30 29.6202 ( 0.00%) 25.4126 * 14.21%* 30.5034 ( -2.98%) Amean 32 31.7582 ( 0.00%) 28.5946 * 9.96%* 33.0222 * -3.98%* Stddev 1 0.0309 ( 0.00%) 0.0294 ( 5.09%) 0.0526 ( -69.92%) Stddev 3 0.1102 ( 0.00%) 0.0670 ( 39.25%) 0.0956 ( 13.30%) Stddev 5 0.2023 ( 0.00%) 0.1089 ( 46.16%) 0.0694 ( 65.68%) Stddev 7 0.3783 ( 0.00%) 0.1408 ( 62.79%) 0.2421 ( 36.01%) Stddev 12 1.5160 ( 0.00%) 0.3483 ( 77.02%) 1.3210 ( 12.87%) Stddev 18 1.3000 ( 0.00%) 1.2858 ( 1.09%) 0.8736 ( 32.80%) Stddev 24 0.8749 ( 0.00%) 0.8663 ( 0.98%) 0.8197 ( 6.31%) Stddev 30 1.3074 ( 0.00%) 1.4297 ( -9.36%) 1.2882 ( 1.46%) Stddev 32 1.2521 ( 0.00%) 1.0791 ( 13.81%) 0.7319 ( 41.54%) 'noHT' becomes _super_fast_ (as compared to 'HT') when the system is overloaded. And that is (continues to be) quite weird. About 'csc', it's not too bad, as it performs in line with 'HT'. MUTILATE ======== 1) BM-HT BM-noHT Hmean 1 60932.67 ( 0.00%) 60817.19 ( -0.19%) Hmean 3 150762.70 ( 0.00%) 123570.43 * -18.04%* Hmean 5 197564.84 ( 0.00%) 176716.37 * -10.55%* Hmean 7 178089.27 ( 0.00%) 170859.21 * -4.06%* Hmean 8 151372.16 ( 0.00%) 160742.09 * 6.19%* Stddev 1 591.29 ( 0.00%) 513.06 ( 13.23%) Stddev 3 1358.38 ( 0.00%) 1783.77 ( -31.32%) Stddev 5 3477.60 ( 0.00%) 5380.83 ( -54.73%) Stddev 7 2676.96 ( 0.00%) 5119.18 ( -91.23%) Stddev 8 696.32 ( 0.00%) 2888.47 (-314.82%) It's less of a clear cut whether or not HT helps or not... 2) v-XEN-HT XEN-HT Hmean 1 31121.40 ( 0.00%) 31063.01 ( -0.19%) Hmean 3 75847.21 ( 0.00%) 75373.02 ( -0.63%) Hmean 5 113823.13 ( 0.00%) 113127.84 ( -0.61%) Hmean 7 112553.93 ( 0.00%) 112467.16 ( -0.08%) Hmean 8 103472.02 ( 0.00%) 103246.75 ( -0.22%) Stddev 1 147.14 ( 0.00%) 175.64 ( -19.37%) Stddev 3 920.80 ( 0.00%) 1017.40 ( -10.49%) Stddev 5 339.23 ( 0.00%) 781.99 (-130.52%) Stddev 7 248.83 ( 0.00%) 404.74 ( -62.66%) Stddev 8 132.22 ( 0.00%) 274.16 (-107.35%) 3) v-XEN-noHT XEN-noHT Hmean 1 31182.90 ( 0.00%) 31107.06 ( -0.24%) Hmean 3 83473.18 ( 0.00%) 84387.92 * 1.10%* Hmean 5 90322.68 ( 0.00%) 91396.40 ( 1.19%) Hmean 7 91082.97 ( 0.00%) 91296.28 ( 0.23%) Hmean 8 89529.42 ( 0.00%) 89442.61 ( -0.10%) Stddev 1 369.37 ( 0.00%) 336.37 ( 8.93%) Stddev 3 300.44 ( 0.00%) 333.46 ( -10.99%) Stddev 5 842.22 ( 0.00%) 816.23 ( 3.08%) Stddev 7 907.80 ( 0.00%) 1084.18 ( -19.43%) Stddev 8 362.90 ( 0.00%) 890.80 (-145.47%) 4) XEN-HT XEN-noHT XEN-csc Hmean 1 31063.01 ( 0.00%) 31107.06 ( 0.14%) 29609.98 * -4.68%* Hmean 3 75373.02 ( 0.00%) 84387.92 * 11.96%* 73419.60 * -2.59%* Hmean 5 113127.84 ( 0.00%) 91396.40 * -19.21%* 100843.37 * -10.86%* Hmean 7 112467.16 ( 0.00%) 91296.28 * -18.82%* 108439.16 * -3.58%* Hmean 8 103246.75 ( 0.00%) 89442.61 * -13.37%* 101164.18 * -2.02%* Stddev 1 175.64 ( 0.00%) 336.37 ( -91.51%) 271.28 ( -54.46%) Stddev 3 1017.40 ( 0.00%) 333.46 ( 67.22%) 367.95 ( 63.83%) Stddev 5 781.99 ( 0.00%) 816.23 ( -4.38%) 549.77 ( 29.70%) Stddev 7 404.74 ( 0.00%) 1084.18 (-167.87%) 376.92 ( 6.87%) Stddev 8 274.16 ( 0.00%) 890.80 (-224.93%) 987.59 (-260.23%) And it's a mixed bag even here, but it clearly looks like 'csc' helps (i.e., let us achieve better perf than 'noHT' when in and above saturation). 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Hmean 1 39583.53 ( 0.00%) 38828.65 * -1.91%* 35103.76 * -11.32%* Hmean 3 87210.00 ( 0.00%) 57874.31 * -33.64%* 47160.22 * -45.92%* Hmean 5 130676.61 ( 0.00%) 42883.21 * -67.18%* 85504.60 * -34.57%* Hmean 7 157586.93 ( 0.00%) 59146.02 * -62.47%* 136717.21 * -13.24%* Hmean 8 174113.49 ( 0.00%) 75503.09 * -56.64%* 159666.07 * -8.30%* Stddev 1 160.68 ( 0.00%) 232.01 ( -44.39%) 547.91 (-240.99%) Stddev 3 1158.62 ( 0.00%) 654.05 ( 43.55%) 18841.25 (-1526.19%) Stddev 5 298.35 ( 0.00%) 542.51 ( -81.84%) 1274.93 (-327.32%) Stddev 7 2569.12 ( 0.00%) 1034.22 ( 59.74%) 4163.80 ( -62.07%) Stddev 8 982.48 ( 0.00%) 692.10 ( 29.56%) 1324.30 ( -34.79%) 'csc' helps, and it let us achieve better performance than 'noHT', overall. Absolute performance looks worse than in dom0, but the 'noHT' case seems to be affected even more. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Hmean 1 37156.88 ( 0.00%) 36938.92 ( -0.59%) 34869.19 * -6.16%* Hmean 3 91170.71 ( 0.00%) 91469.09 ( 0.33%) 76214.51 * -16.40%* Hmean 5 119020.55 ( 0.00%) 126949.11 * 6.66%* 109062.82 * -8.37%* Hmean 7 119924.66 ( 0.00%) 133372.57 * 11.21%* 110466.18 * -7.89%* Hmean 8 121329.34 ( 0.00%) 136100.78 ( 12.17%) 114071.36 * -5.98%* Stddev 1 214.02 ( 0.00%) 362.58 ( -69.42%) 516.53 (-141.35%) Stddev 3 2554.47 ( 0.00%) 1549.81 ( 39.33%) 748.72 ( 70.69%) Stddev 5 1061.32 ( 0.00%) 2250.11 (-112.01%) 2377.12 (-123.98%) Stddev 7 5486.68 ( 0.00%) 3214.60 ( 41.41%) 4147.45 ( 24.41%) Stddev 8 1857.60 ( 0.00%) 12837.92 (-591.10%) 3400.78 ( -83.07%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Hmean 1 30194.08 ( 0.00%) 2457.66 * -91.86%* 4940.93 * -83.64%* Hmean 3 33860.02 ( 0.00%) 11209.64 * -66.89%* 23274.26 * -31.26%* Hmean 5 45606.53 ( 0.00%) 15984.89 * -64.95%* 38599.75 ( -15.36%) Hmean 7 70754.93 ( 0.00%) 23045.58 * -67.43%* 55799.74 * -21.14%* Hmean 8 88529.15 ( 0.00%) 36532.64 * -58.73%* 70367.33 * -20.52%* Stddev 1 214.55 ( 0.00%) 2.87 ( 98.66%) 13159.62 (-6033.62%) Stddev 3 440.61 ( 0.00%) 179.15 ( 59.34%) 4463.14 (-912.95%) Stddev 5 900.27 ( 0.00%) 161.20 ( 82.09%) 6286.95 (-598.34%) Stddev 7 706.22 ( 0.00%) 708.81 ( -0.37%) 3845.49 (-444.52%) Stddev 8 3197.46 ( 0.00%) 818.97 ( 74.39%) 4477.51 ( -40.03%) Similar to kernbench: 'csc' a lot better than 'noHT' in overload. PGIOPERFBENCH ============= 1) BM-HT BM-noHT Amean commit 7.72 ( 0.00%) 9.73 * -26.14%* Amean read 6.99 ( 0.00%) 8.15 * -16.66%* Amean wal 0.02 ( 0.00%) 0.02 ( 3.63%) Stddev commit 1.35 ( 0.00%) 5.48 (-304.84%) Stddev read 0.61 ( 0.00%) 3.82 (-521.73%) Stddev wal 0.04 ( 0.00%) 0.04 ( 1.36%) Commit and read runs are suffering withough HT. 2) v-XEN-HT XEN-HT Amean commit 7.88 ( 0.00%) 8.78 * -11.43%* Amean read 6.85 ( 0.00%) 7.71 * -12.54%* Amean wal 0.07 ( 0.00%) 0.07 ( 1.15%) Stddev commit 1.86 ( 0.00%) 3.59 ( -92.56%) Stddev read 0.53 ( 0.00%) 2.89 (-450.88%) Stddev wal 0.05 ( 0.00%) 0.05 ( -0.86%) Significant overhead introduced by core scheduler patches, with sched-gran=cpu and HT enabled. 3) v-XEN-noHT XEN-noHT Amean commit 8.53 ( 0.00%) 7.64 * 10.47%* Amean read 7.30 ( 0.00%) 6.86 * 6.02%* Amean wal 0.08 ( 0.00%) 0.08 ( 2.61%) Stddev commit 4.27 ( 0.00%) 1.35 ( 68.46%) Stddev read 3.12 ( 0.00%) 0.39 ( 87.45%) Stddev wal 0.04 ( 0.00%) 0.04 ( 6.28%) While not so much (actually, things are faster) with sched-gran=cpu and HT disabled. 4) XEN-HT XEN-noHT XEN-csc Amean commit 8.78 ( 0.00%) 7.64 * 13.03%* 8.20 ( 6.63%) Amean read 7.71 ( 0.00%) 6.86 * 10.95%* 7.21 * 6.43%* Amean wal 0.07 ( 0.00%) 0.08 * -12.33%* 0.08 ( -7.96%) Stddev commit 3.59 ( 0.00%) 1.35 ( 62.41%) 3.50 ( 2.37%) Stddev read 2.89 ( 0.00%) 0.39 ( 86.46%) 2.50 ( 13.52%) Stddev wal 0.05 ( 0.00%) 0.04 ( 11.34%) 0.04 ( 6.71%) Interestingly, when running under Xen and without HT, 'commit' and 'read' runs are not suffering any longer (they're actually faster!). 'wal' is. With core scheduling, WAL still slows down, but a little bit less, and commit and read are still fine (although they speed-up less). 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Amean commit 7.83 ( 0.00%) 7.74 ( 1.16%) 7.68 ( 1.87%) Amean read 6.97 ( 0.00%) 6.97 ( -0.03%) 6.92 * 0.62%* Amean wal 0.04 ( 0.00%) 0.01 * 79.85%* 0.07 * -84.69%* Stddev commit 1.60 ( 0.00%) 1.54 ( 3.79%) 1.59 ( 0.35%) Stddev read 0.53 ( 0.00%) 0.55 ( -3.61%) 0.68 ( -27.35%) Stddev wal 0.05 ( 0.00%) 0.03 ( 45.88%) 0.05 ( 2.11%) 'commit' and 'read' looks similar to the dom0 case. 'wal' is weird, as it went very very good in the 'noHT' case... I'm not sure how significant this benchmark is, to be honest. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Stddev commit 1.62 ( 0.00%) 1.51 ( 7.24%) 1.65 ( -1.80%) Stddev read 0.55 ( 0.00%) 0.66 ( -19.98%) 0.57 ( -2.02%) Stddev wal 0.05 ( 0.00%) 0.05 ( -6.25%) 0.05 ( -5.64%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Amean commit 7.94 ( 0.00%) 8.31 * -4.57%* 10.28 * -29.45%* Amean read 6.98 ( 0.00%) 7.91 * -13.20%* 9.47 * -35.58%* Amean wal 0.01 ( 0.00%) 0.07 *-415.01%* 0.24 *-1752.46%* Stddev commit 1.75 ( 0.00%) 1.30 ( 26.10%) 2.17 ( -23.63%) Stddev read 0.67 ( 0.00%) 0.79 ( -17.48%) 2.21 (-228.37%) Stddev wal 0.03 ( 0.00%) 0.09 (-157.29%) 0.39 (-1057.77%) Here, on the other hand, 'csc' regresses against 'noHT'. We've seen 'wal' fluctuating a bit. But in this case, even if we'd ignore it, we see that also 'commit' and 'read' slow down significantly. NETPERF-UDP =========== 1) BM-HT BM-noHT Hmean send-64 395.57 ( 0.00%) 402.32 * 1.70%* Hmean send-256 1474.96 ( 0.00%) 1515.87 * 2.77%* Hmean send-2048 10592.86 ( 0.00%) 11137.39 * 5.14%* Hmean send-8192 27861.86 ( 0.00%) 29965.72 * 7.55%* Hmean recv-64 395.51 ( 0.00%) 402.30 * 1.72%* Hmean recv-256 1474.96 ( 0.00%) 1515.61 * 2.76%* Hmean recv-2048 10591.01 ( 0.00%) 11135.47 * 5.14%* Hmean recv-8192 27858.04 ( 0.00%) 29956.95 * 7.53%* Stddev send-64 4.61 ( 0.00%) 3.01 ( 34.70%) Stddev send-256 28.10 ( 0.00%) 11.44 ( 59.29%) Stddev send-2048 164.91 ( 0.00%) 112.24 ( 31.94%) Stddev send-8192 930.19 ( 0.00%) 416.78 ( 55.19%) Stddev recv-64 4.73 ( 0.00%) 3.00 ( 36.49%) Stddev recv-256 28.10 ( 0.00%) 11.86 ( 57.79%) Stddev recv-2048 163.52 ( 0.00%) 111.83 ( 31.61%) Stddev recv-8192 930.48 ( 0.00%) 415.22 ( 55.38%) 'noHT' being consistently --although marginally, in most cases-- better seems to reveal non-ideality in how SMT is handled in the Linux scheduler. 2) v-XEN-HT XEN-HT Hmean send-64 223.35 ( 0.00%) 220.23 * -1.39%* Hmean send-256 876.26 ( 0.00%) 863.81 * -1.42%* Hmean send-2048 6575.95 ( 0.00%) 6542.13 ( -0.51%) Hmean send-8192 20882.75 ( 0.00%) 20999.88 ( 0.56%) Hmean recv-64 223.35 ( 0.00%) 220.23 * -1.39%* Hmean recv-256 876.22 ( 0.00%) 863.81 * -1.42%* Hmean recv-2048 6575.00 ( 0.00%) 6541.22 ( -0.51%) Hmean recv-8192 20881.63 ( 0.00%) 20996.13 ( 0.55%) Stddev send-64 1.16 ( 0.00%) 0.94 ( 19.15%) Stddev send-256 7.50 ( 0.00%) 7.71 ( -2.79%) Stddev send-2048 29.58 ( 0.00%) 57.84 ( -95.56%) Stddev send-8192 106.01 ( 0.00%) 172.69 ( -62.90%) Stddev recv-64 1.16 ( 0.00%) 0.94 ( 19.15%) Stddev recv-256 7.50 ( 0.00%) 7.70 ( -2.70%) Stddev recv-2048 28.84 ( 0.00%) 56.58 ( -96.20%) Stddev recv-8192 105.96 ( 0.00%) 167.07 ( -57.68%) 3) v-XEN-noHT XEN-noHT Hmean send-64 222.53 ( 0.00%) 220.12 * -1.08%* Hmean send-256 874.65 ( 0.00%) 866.59 ( -0.92%) Hmean send-2048 6595.97 ( 0.00%) 6568.38 ( -0.42%) Hmean send-8192 21537.45 ( 0.00%) 21163.69 * -1.74%* Hmean recv-64 222.52 ( 0.00%) 220.08 * -1.10%* Hmean recv-256 874.55 ( 0.00%) 866.47 ( -0.92%) Hmean recv-2048 6595.64 ( 0.00%) 6566.75 ( -0.44%) Hmean recv-8192 21534.73 ( 0.00%) 21157.86 * -1.75%* Stddev send-64 1.45 ( 0.00%) 1.64 ( -13.12%) Stddev send-256 4.96 ( 0.00%) 8.92 ( -79.63%) Stddev send-2048 35.06 ( 0.00%) 64.88 ( -85.05%) Stddev send-8192 199.63 ( 0.00%) 161.66 ( 19.02%) Stddev recv-64 1.44 ( 0.00%) 1.62 ( -12.50%) Stddev recv-256 4.94 ( 0.00%) 9.03 ( -82.65%) Stddev recv-2048 34.99 ( 0.00%) 65.80 ( -88.02%) Stddev recv-8192 201.10 ( 0.00%) 162.03 ( 19.43%) 4) XEN-HT XEN-noHT XEN-csc Hmean send-64 220.23 ( 0.00%) 220.12 ( -0.05%) 199.70 * -9.32%* Hmean send-256 863.81 ( 0.00%) 866.59 ( 0.32%) 791.34 * -8.39%* Hmean send-2048 6542.13 ( 0.00%) 6568.38 ( 0.40%) 5758.09 * -11.98%* Hmean send-8192 20999.88 ( 0.00%) 21163.69 ( 0.78%) 19706.48 * -6.16%* Hmean recv-64 220.23 ( 0.00%) 220.08 ( -0.07%) 199.70 * -9.32%* Hmean recv-256 863.81 ( 0.00%) 866.47 ( 0.31%) 791.09 * -8.42%* Hmean recv-2048 6541.22 ( 0.00%) 6566.75 ( 0.39%) 5757.50 * -11.98%* Hmean recv-8192 20996.13 ( 0.00%) 21157.86 ( 0.77%) 19703.65 * -6.16%* Stddev send-64 0.94 ( 0.00%) 1.64 ( -74.89%) 21.30 (-2173.95%) Stddev send-256 7.71 ( 0.00%) 8.92 ( -15.74%) 63.89 (-729.17%) Stddev send-2048 57.84 ( 0.00%) 64.88 ( -12.18%) 112.05 ( -93.72%) Stddev send-8192 172.69 ( 0.00%) 161.66 ( 6.39%) 276.33 ( -60.02%) Stddev recv-64 0.94 ( 0.00%) 1.62 ( -72.94%) 21.30 (-2173.74%) Stddev recv-256 7.70 ( 0.00%) 9.03 ( -17.23%) 64.08 (-732.24%) Stddev recv-2048 56.58 ( 0.00%) 65.80 ( -16.28%) 112.00 ( -97.94%) Stddev recv-8192 167.07 ( 0.00%) 162.03 ( 3.01%) 276.84 ( -65.70%) In this case, 'csc' is performing worse than 'noHT', pretty consistently. This again may be due to sub-optimal placement of the vCPUs, with core-scheduling enabled. 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Hmean send-64 518.85 ( 0.00%) 487.85 * -5.97%* 505.10 ( -2.65%) Hmean send-256 1955.77 ( 0.00%) 1784.00 * -8.78%* 1931.18 ( -1.26%) Hmean send-2048 12849.78 ( 0.00%) 13027.81 ( 1.39%) 13485.59 ( 4.95%) Hmean send-8192 37659.20 ( 0.00%) 35773.79 * -5.01%* 35040.88 * -6.95%* Hmean recv-64 518.32 ( 0.00%) 487.76 * -5.89%* 504.86 ( -2.60%) Hmean recv-256 1954.42 ( 0.00%) 1782.90 * -8.78%* 1930.33 ( -1.23%) Hmean recv-2048 12830.91 ( 0.00%) 13018.47 ( 1.46%) 13475.39 ( 5.02%) Hmean recv-8192 37619.84 ( 0.00%) 35768.42 * -4.92%* 35035.40 * -6.87%* Stddev send-64 29.69 ( 0.00%) 6.43 ( 78.34%) 17.36 ( 41.53%) Stddev send-256 64.17 ( 0.00%) 26.82 ( 58.21%) 33.37 ( 47.99%) Stddev send-2048 1342.98 ( 0.00%) 246.21 ( 81.67%) 353.97 ( 73.64%) Stddev send-8192 224.20 ( 0.00%) 467.40 (-108.48%) 928.70 (-314.24%) Stddev recv-64 29.96 ( 0.00%) 6.54 ( 78.16%) 17.64 ( 41.11%) Stddev recv-256 64.29 ( 0.00%) 27.28 ( 57.56%) 34.17 ( 46.85%) Stddev recv-2048 1353.73 ( 0.00%) 251.38 ( 81.43%) 368.14 ( 72.81%) Stddev recv-8192 206.98 ( 0.00%) 467.59 (-125.91%) 928.73 (-348.71%) This looks different than the dom0 case. Basically, here, 'noHT' and 'csc' look on par (maybe 'csc' is even a little bit better). Another difference is that 'noHT' slows down more, as compared to 'HT'. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Hmean send-64 513.95 ( 0.00%) 511.96 ( -0.39%) 458.46 * -10.80%* Hmean send-256 1928.30 ( 0.00%) 1930.57 ( 0.12%) 1831.70 * -5.01%* Hmean send-2048 13509.70 ( 0.00%) 13802.90 * 2.17%* 12739.81 * -5.70%* Hmean send-8192 35966.73 ( 0.00%) 36086.45 ( 0.33%) 35293.89 ( -1.87%) Hmean recv-64 513.90 ( 0.00%) 511.91 ( -0.39%) 458.34 * -10.81%* Hmean recv-256 1927.94 ( 0.00%) 1930.48 ( 0.13%) 1831.50 * -5.00%* Hmean recv-2048 13506.19 ( 0.00%) 13799.61 * 2.17%* 12734.53 * -5.71%* Hmean recv-8192 35936.42 ( 0.00%) 36069.50 ( 0.37%) 35266.17 ( -1.87%) Stddev send-64 12.88 ( 0.00%) 3.89 ( 69.78%) 16.11 ( -25.10%) Stddev send-256 37.13 ( 0.00%) 19.62 ( 47.16%) 30.54 ( 17.75%) Stddev send-2048 86.00 ( 0.00%) 100.64 ( -17.03%) 311.99 (-262.80%) Stddev send-8192 926.89 ( 0.00%) 1088.50 ( -17.44%) 2801.65 (-202.26%) Stddev recv-64 12.93 ( 0.00%) 3.96 ( 69.36%) 16.15 ( -24.90%) Stddev recv-256 37.76 ( 0.00%) 19.75 ( 47.68%) 30.68 ( 18.73%) Stddev recv-2048 88.84 ( 0.00%) 100.51 ( -13.13%) 313.79 (-253.20%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Hmean send-64 359.65 ( 0.00%) 251.84 * -29.98%* 348.95 ( -2.97%) Hmean send-256 1376.16 ( 0.00%) 969.58 * -29.54%* 1293.19 * -6.03%* Hmean send-2048 10212.34 ( 0.00%) 6893.41 * -32.50%* 8933.39 * -12.52%* Hmean send-8192 30442.79 ( 0.00%) 18372.42 * -39.65%* 25486.82 * -16.28%* Hmean recv-64 359.01 ( 0.00%) 240.67 * -32.96%* 319.95 * -10.88%* Hmean recv-256 1372.71 ( 0.00%) 923.23 * -32.74%* 1206.20 * -12.13%* Hmean recv-2048 10185.94 ( 0.00%) 6258.59 * -38.56%* 8371.03 * -17.82%* Hmean recv-8192 30272.47 ( 0.00%) 16888.08 * -44.21%* 22719.70 * -24.95%* Stddev send-64 6.49 ( 0.00%) 7.35 ( -13.35%) 18.06 (-178.42%) Stddev send-256 29.41 ( 0.00%) 15.84 ( 46.15%) 25.77 ( 12.38%) Stddev send-2048 97.12 ( 0.00%) 238.24 (-145.32%) 276.14 (-184.34%) Stddev send-8192 680.48 ( 0.00%) 336.65 ( 50.53%) 482.57 ( 29.08%) Stddev recv-64 6.80 ( 0.00%) 3.56 ( 47.64%) 6.62 ( 2.60%) Stddev recv-256 31.00 ( 0.00%) 10.11 ( 67.39%) 14.94 ( 51.80%) Stddev recv-2048 99.42 ( 0.00%) 96.49 ( 2.94%) 42.24 ( 57.51%) Stddev recv-8192 759.75 ( 0.00%) 108.46 ( 85.72%) 745.22 ( 1.91%) And even for this benchmark, for which 'csc' does not perform very well in general, under load, it does good vs. 'noHT'. NETPERF-TCP =========== 1) BM-HT BM-noHT Hmean 64 2495.25 ( 0.00%) 2509.94 * 0.59%* Hmean 256 7331.05 ( 0.00%) 7350.96 ( 0.27%) Hmean 2048 32003.90 ( 0.00%) 31975.79 ( -0.09%) Hmean 8192 50482.11 ( 0.00%) 51254.29 ( 1.53%) Stddev 64 12.54 ( 0.00%) 9.28 ( 26.01%) Stddev 256 20.90 ( 0.00%) 35.47 ( -69.76%) Stddev 2048 160.50 ( 0.00%) 186.36 ( -16.12%) Stddev 8192 944.33 ( 0.00%) 365.62 ( 61.28%) Basically on par. 2) v-XEN-HT XEN-HT Hmean 64 674.18 ( 0.00%) 677.61 ( 0.51%) Hmean 256 2424.16 ( 0.00%) 2448.40 * 1.00%* Hmean 2048 15114.37 ( 0.00%) 15160.59 ( 0.31%) Hmean 8192 32626.11 ( 0.00%) 32056.40 ( -1.75%) Stddev 64 5.99 ( 0.00%) 5.93 ( 0.98%) Stddev 256 12.07 ( 0.00%) 15.11 ( -25.17%) Stddev 2048 106.56 ( 0.00%) 180.05 ( -68.97%) Stddev 8192 649.77 ( 0.00%) 329.45 ( 49.30%) 3) v-XEN-noHT XEN-noHT Hmean 64 674.89 ( 0.00%) 679.64 ( 0.70%) Hmean 256 2426.14 ( 0.00%) 2443.24 * 0.70%* Hmean 2048 15273.97 ( 0.00%) 15372.31 * 0.64%* Hmean 8192 32895.01 ( 0.00%) 32473.01 * -1.28%* Stddev 64 3.90 ( 0.00%) 6.01 ( -54.19%) Stddev 256 8.14 ( 0.00%) 18.76 (-130.35%) Stddev 2048 35.02 ( 0.00%) 86.00 (-145.62%) Stddev 8192 123.38 ( 0.00%) 448.67 (-263.64%) 4) XEN-HT XEN-noHT XEN-csc Hmean 64 677.61 ( 0.00%) 679.64 ( 0.30%) 660.08 ( -2.59%) Hmean 256 2448.40 ( 0.00%) 2443.24 ( -0.21%) 2402.64 * -1.87%* Hmean 2048 15160.59 ( 0.00%) 15372.31 * 1.40%* 14591.20 * -3.76%* Hmean 8192 32056.40 ( 0.00%) 32473.01 ( 1.30%) 32024.74 ( -0.10%) Stddev 64 5.93 ( 0.00%) 6.01 ( -1.39%) 31.04 (-423.45%) Stddev 256 15.11 ( 0.00%) 18.76 ( -24.13%) 46.90 (-210.39%) Stddev 2048 180.05 ( 0.00%) 86.00 ( 52.23%) 339.19 ( -88.39%) Stddev 8192 329.45 ( 0.00%) 448.67 ( -36.18%) 1120.36 (-240.07%) Also basically on par. Well, 'csc' is a little worse, but not as much as the UDP case was. This suggests looking at differences in ways in which, from a scheduling point of view, the two protocols are handled, could help understand what is happening. 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Hmean 64 1756.72 ( 0.00%) 1687.44 * -3.94%* 1644.27 * -6.40%* Hmean 256 6335.31 ( 0.00%) 6142.90 * -3.04%* 5850.62 * -7.65%* Hmean 2048 29738.54 ( 0.00%) 28871.57 * -2.92%* 27437.72 * -7.74%* Hmean 8192 46926.01 ( 0.00%) 46011.63 * -1.95%* 42627.47 * -9.16%* Stddev 64 35.93 ( 0.00%) 13.81 ( 61.56%) 14.56 ( 59.48%) Stddev 256 110.09 ( 0.00%) 53.63 ( 51.28%) 142.11 ( -29.08%) Stddev 2048 172.37 ( 0.00%) 189.21 ( -9.77%) 1004.81 (-482.94%) Stddev 8192 453.49 ( 0.00%) 239.72 ( 47.14%) 1262.32 (-178.36%) 'csc' not vastly, but consistently worse than 'noHT'. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Stddev 64 20.10 ( 0.00%) 10.70 ( 46.75%) 45.83 (-128.07%) Stddev 256 47.00 ( 0.00%) 32.37 ( 31.14%) 45.15 ( 3.95%) Stddev 2048 205.29 ( 0.00%) 148.42 ( 27.70%) 1950.42 (-850.08%) Stddev 8192 515.92 ( 0.00%) 62.10 ( 87.96%) 11909.19 (-2208.33%) CoeffVar 64 1.19 ( 0.00%) 0.64 ( 46.37%) 2.79 (-135.02%) CoeffVar 256 0.77 ( 0.00%) 0.53 ( 31.25%) 0.76 ( 1.48%) CoeffVar 2048 0.71 ( 0.00%) 0.51 ( 27.61%) 7.00 (-887.77%) CoeffVar 8192 1.13 ( 0.00%) 0.13 ( 88.04%) 24.76 (-2096.20%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Hmean 64 1272.66 ( 0.00%) 996.12 * -21.73%* 968.44 * -23.90%* Hmean 256 4376.58 ( 0.00%) 3565.25 * -18.54%* 4088.77 * -6.58%* Hmean 2048 24077.77 ( 0.00%) 17496.21 * -27.33%* 21709.20 * -9.84%* Hmean 8192 43332.46 ( 0.00%) 32506.09 * -24.98%* 36700.18 * -15.31%* Stddev 64 13.14 ( 0.00%) 3.25 ( 75.23%) 221.08 (-1582.74%) Stddev 256 38.64 ( 0.00%) 43.54 ( -12.66%) 316.66 (-719.41%) Stddev 2048 202.53 ( 0.00%) 117.33 ( 42.07%) 208.33 ( -2.86%) Stddev 8192 274.71 ( 0.00%) 299.39 ( -8.98%) 412.99 ( -50.33%) Mixed. Some transfer size are better with 'csc' than 'noHT', some are worse. NETPERF-UNIX ============ 1) BM-HT BM-noHT Hmean 64 998.53 ( 0.00%) 1000.92 ( 0.24%) Hmean 256 3672.82 ( 0.00%) 3822.29 * 4.07%* Hmean 2048 7951.98 ( 0.00%) 7987.46 ( 0.45%) Hmean 8192 8218.97 ( 0.00%) 8329.87 ( 1.35%) Stddev 64 38.66 ( 0.00%) 15.94 ( 58.78%) Stddev 256 99.05 ( 0.00%) 93.25 ( 5.86%) Stddev 2048 87.53 ( 0.00%) 40.49 ( 53.75%) Stddev 8192 133.15 ( 0.00%) 74.00 ( 44.42%) Substantially on par. 2) v-XEN-HT XEN-HT Hmean 64 499.62 ( 0.00%) 492.89 ( -1.35%) Hmean 256 1900.97 ( 0.00%) 1852.77 ( -2.54%) Hmean 2048 2892.22 ( 0.00%) 2765.65 * -4.38%* Hmean 8192 3105.85 ( 0.00%) 2996.64 * -3.52%* Stddev 64 4.49 ( 0.00%) 10.64 (-137.15%) Stddev 256 87.54 ( 0.00%) 54.63 ( 37.59%) Stddev 2048 35.26 ( 0.00%) 18.20 ( 48.39%) Stddev 8192 21.32 ( 0.00%) 15.74 ( 26.15%) 3) v-XEN-noHT XEN-noHT Hmean 64 509.03 ( 0.00%) 485.34 * -4.65%* Hmean 256 1894.53 ( 0.00%) 1835.34 * -3.12%* Hmean 2048 2906.15 ( 0.00%) 2747.32 * -5.47%* Hmean 8192 3158.33 ( 0.00%) 2968.85 * -6.00%* Stddev 64 3.39 ( 0.00%) 20.01 (-489.71%) Stddev 256 51.32 ( 0.00%) 15.13 ( 70.52%) Stddev 2048 33.35 ( 0.00%) 9.37 ( 71.89%) Stddev 8192 32.53 ( 0.00%) 11.30 ( 65.26%) In both 'HT' and 'noHT' cases, overhead of just having the patch applied is not too big, but it's there. It may be worth studying this workload, trying to figure what are the paths that are introducing such overhead (with the goal of optimizing them). 4) XEN-HT XEN-noHT XEN-csc Hmean 64 492.89 ( 0.00%) 485.34 ( -1.53%) 451.13 * -8.47%* Hmean 256 1852.77 ( 0.00%) 1835.34 ( -0.94%) 1272.07 * -31.34%* Hmean 2048 2765.65 ( 0.00%) 2747.32 * -0.66%* 3086.19 ( 11.59%) Hmean 8192 2996.64 ( 0.00%) 2968.85 * -0.93%* 2731.31 * -8.85%* Stddev 64 10.64 ( 0.00%) 20.01 ( -88.04%) 25.28 (-137.53%) Stddev 256 54.63 ( 0.00%) 15.13 ( 72.31%) 393.61 (-620.45%) Stddev 2048 18.20 ( 0.00%) 9.37 ( 48.48%) 946.82 (-5103.09%) Stddev 8192 15.74 ( 0.00%) 11.30 ( 28.21%) 104.54 (-564.07%) Mixed-bag, with higher (both positive and negative, mostly negative) peaks. 5) HVM-v8-HT HVM-v8-noHT HVM-v8-csc Hmean 64 275.22 ( 0.00%) 288.06 ( 4.66%) 277.23 ( 0.73%) Hmean 256 977.34 ( 0.00%) 877.63 ( -10.20%) 761.25 * -22.11%* Hmean 2048 4702.72 ( 0.00%) 5819.12 ( 23.74%) 5531.96 ( 17.63%) Hmean 8192 5962.89 ( 0.00%) 6462.60 ( 8.38%) 5387.48 ( -9.65%) Stddev 64 48.37 ( 0.00%) 46.27 ( 4.34%) 59.79 ( -23.61%) Stddev 256 123.98 ( 0.00%) 48.93 ( 60.54%) 163.00 ( -31.47%) Stddev 2048 1181.41 ( 0.00%) 761.99 ( 35.50%) 827.15 ( 29.99%) Stddev 8192 1708.12 ( 0.00%) 1247.83 ( 26.95%) 1065.82 ( 37.60%) What we see here, is that 'noHT' deviates more from 'HT'. For 3 configurations out of 4, for the better (i.e., 'noHT' is faster than 'HT'). 'csc' is, again, generally worse than 'noHT'. 6) HVM-v4-HT HVM-v4-noHT HVM-v4-csc Hmean 64 225.32 ( 0.00%) 239.36 ( 6.23%) 203.05 ( -9.88%) Hmean 256 707.19 ( 0.00%) 782.19 ( 10.60%) 751.77 ( 6.30%) Hmean 2048 2719.26 ( 0.00%) 4156.29 * 52.85%* 3624.27 ( 33.28%) Hmean 8192 4263.18 ( 0.00%) 5724.08 * 34.27%* 3579.16 ( -16.04%) Stddev 64 46.49 ( 0.00%) 31.12 ( 33.07%) 11.06 ( 76.21%) Stddev 256 119.32 ( 0.00%) 130.87 ( -9.67%) 87.78 ( 26.44%) Stddev 2048 1030.34 ( 0.00%) 907.53 ( 11.92%) 1715.05 ( -66.46%) Stddev 8192 821.15 ( 0.00%) 275.68 ( 66.43%) 1131.40 ( -37.78%) 7) HVMx2-v8-HT HVMx2-v8-noHT HVMx2-v8-csc Hmean 64 243.29 ( 0.00%) 165.06 * -32.16%* 188.01 * -22.72%* Hmean 256 711.91 ( 0.00%) 472.92 * -33.57%* 504.46 * -29.14%* Hmean 2048 3163.60 ( 0.00%) 1955.93 * -38.17%* 2880.11 ( -8.96%) Hmean 8192 4277.51 ( 0.00%) 2377.98 * -44.41%* 3086.59 * -27.84%* Stddev 64 21.43 ( 0.00%) 19.08 ( 10.96%) 29.45 ( -37.42%) Stddev 256 86.25 ( 0.00%) 50.56 ( 41.38%) 112.78 ( -30.76%) Stddev 2048 311.30 ( 0.00%) 317.03 ( -1.84%) 610.38 ( -96.07%) Stddev 8192 208.86 ( 0.00%) 314.18 ( -50.42%) 580.38 (-177.88%)