Wilson HPC Computing Facility

Intel Knight's Landing on the Wilson Cluster

Contents

  1. Cluster Layout
  2. Introduction to KNL
  3. Accessing KNL Host
  4. Building Code for the KNL
  5. Code Execution on KNL
  6. Python on KNL
  7. Documentation

1. Cluster Layout

2. Introduction to KNL (Intel Knight's Landing)

The Knight's Landing host attached to the Wilson cluster is a Ninja Developer Platform configured by Colfax (
http://dap.xeonphi.com/). This comes fully configured with memory, local storage, CentOS 7.2, and Intel tools. Included in the package is a one year license for Intel Parallel Studio XE Cluster Edition. The hardware configuration of this machine is as follows:
  • PROCESSOR: Developer Edition of Intel Xeon Phi Processor: Single Socket Intel Knight's Landing 1.30 GHz, 64 core
  • MEMORY: 96GB (6x 16GB dimms) 2133MHz DDR4 main memory, 16GB MCDRAM memory (a.k.a High Bandwidth Memory)
  • LAN: 2x IntelĀ® i350 Gigabit Ethernet
  • OS: CentOS 7.21
  • Compiler: Intel Parallel Studio XE Cluster Edition

Photo: Ninja Developer Platform: Knight's Landing host

The following graphic provides an abstraction of the hierarchical topology of the CPU installed in the KNL host. The graphic was generated using The Portable Hardware Locality (hwloc) software package. The graphic lists NUMA memory nodes, sockets, shared caches and cores. Note: Please click on the image for a larger more readable version. A textual version of this graphic is available here.


Photo: Hierarchical topology of KNL CPU.

On the Wilson cluster we have two different models of the Intel Phi processors, the first generation codenamed as Knights Corner (KNC, these are the phi1-phi4 nodes) and the second generation known as Knight Landing (KNL, this is the knl1 node). KNC-based processors are very similar to original Pentium cores from circa 1993, except:

  • They are 64 bit (x86_64).
  • They execute code strictly in-order, so no branch prediction or speculative execution.
  • They have 512-bit-wide vector floating point units, instead of the 128-bit-wide (SSE) or 256-bit-wide (AVX) units on Xeon and AMD Opteron processors.
Each of the 5110P Phi cards has 60 of these cores. The latency of the floating point units is at least 4 cycles depending on the instruction, and the cores maintain 4 hardware threads of execution. So to achieve good performance, on each core the code should keep as many as 4 of these threads busy. Ideally each of the 60 cores properly fed will retire a 512-bit-wide floating point instruction per cycle. Since fused-multiply-adds are supported, this works out to 60 x 8 x 2 x 1052 MHz = 1.0 TFlop/sec (double precision) per Phi card.

The Linux kernel running on the KNC cards sees each of the hardware threads as a separate core. So, the kernel reports a total of 240 cores. The cores share the 8 GB of system memory installed on the KNC card.

The Second generation Knight's Landing (KNL) introduced many improvements over the KNC coprocessors which are:

  • This is a stand-alone processor, i.e., the KNL node does not have a separate "host" processor.
  • It introduces a new memory architecture utilizing two types of memory: 'close-to-metal' configurable MCDRAM and conventional DDR.
  • It introduces a new AVX-512 vector instructions which are architecturally consistent with AVX2 supported by Haswell and Broadwell CPUs
  • It is binary compatible with prior Intel processors
Combined with a new fabrication process, the KNL triples both scalar and vector performance compared with KNC and offers, therefore, up to 3.0 TFlop/sec (double precision) per processor.

The KNL design has a new compute unit, the tile, that consists of two two-wide, out-of-order cores and supports four hyper-threads. That is, the 64-core version of KNL chip comprises 32 tiles in total that are communicating via an on-die cache-coherent, two-dimensional mesh interconnect. The mesh supports three modes:

  1. all-to-all
  2. quadrant (set by default)
  3. sub-NUMA clustering.
The all-to-all mode might be required when memory capacity is not uniform across all channels, while sub-NUMA mode is preferable for applications which are using more then one MPI rank per processor. The KNL memory architecture has two type of memory: the high-bandwidth memory integrated on package, MCDRAM, with capacity upto 16 GB and peak bandwidth over 450 GB/s and external DDR with capacity upto 384 GB (64 GB per channel) and peak bandwidth around 90GB/s. The memory can be configure at boot time in the following three modes:
  1. cache (MCDRAM is treated as cache for DDR)
  2. flat (MCDRAM is treated as standard memory in the same address space as DDR)
  3. hybrid (a portion of MCDRAM is cache and remaining is flat).
To change the default mesh or memory mode settings please email us at tev-admin@fnal.gov

The following graphic summarizes the various High Bandwidth Memory (HBM) modes:

Photo: High Bandwidth Memory Modes. Image courtesy of anandtech.com

3. Accessing KNL Host

4. Building Code for the KNL

Intel Parallel Studio XE Cluster Edition version 16.0.3 is available on the KNL host. To setup your environment (PATH, LD_LIBRARY_PATH, etc.), do the following AFTER submitting an "interactive" job to the knl host as instructed in section 3.2 above.
$> source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh intel64
To build for the KNL processors, use the "-xMIC-AVX512" architecture flag on your compile and link lines. For binary compatibility use "-xCOMMON-AVX512".

5. Code Execution on KNL

KNL code example In /phihome/djholm/test there is the standard "Hello, world" C-language example, compiled for Phi (hello.knl). Below is an example to run directly:

[tev.fnal.gov:]$> qsub -q knl -l nodes=1:knl -A phiadmin -I
qsub: waiting for job 115338.tev.fnal.gov to start
qsub: job 115338.tev.fnal.gov ready

PBS prologue

knl1:~$ cd /phihome/djholm/test

knl1:/phihome/djholm/test$ ./hello.knl 
  Hello, world
To run multithreaded application, set up environment variables as follows:
$> export KMP_AFFINITY=compact, granularity=thread
For example, to execute OpenMP version of the "Hello World", omphelloknl, with 64 threads, do
$> export KMP_PLACE_THREADS=1s,64c,1t
This will launch omphelloknl on KNL processor using one thread per core. To run it in 4 threads, set
export KMP_PLACE_THREADS=1s,64c,4t
Notes on KNL NUMA management: All of the MCDRAM can be used as program allocatable memory ( flat mode ) . In this mode,the entirety of DDR space and MCDRAM space is visible to the operating system and applications. It is exposed as separate NUMA nodes. To find out NUMA nodes distribution, use numactl utility with switch "-H". For example, in the quadrant or all-to-all cluster mode and flat memory mode, numactl -H will show 2 numa nodes, with MCDRAM corresponding to node 1. In order run an application with all allocations going to MCDRAM, one need to set:

numactl -m 1 
If memory mode is set to cache, there is only one NUMA memory node, numbered zero, which correspond to DDR .

In the most complicated case of sub-NUMA cluster and flat memory modes, the number of NUMA nodes will be assigned according to the division of KNL tile cluster into NUMA sub-domains. DDR partitioning will be listed first and MCDRAM partitioning will be listed last.

6. Python on KNL

Intel Python version 2.7 and 3.5 are installed under /opt/intel/intelpython27/ and /opt/intel/intelpython35/. Machine Learning frameworks and modules such as Caffe, Tensorflow, Lasagne and Theano are also installed under those areas.
[@knl ~]$ export PATH=/opt/intel/intelpython27/bin/:$PATH
[@knl ~]$ python -V
Python 2.7.12 :: Intel Corporation

[@knl ~]$ python
Python 2.7.12 |Intel Corporation| (default, Aug 15 2016, 04:18:18) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution
>>> import lasagne; print lasagne.__version__
0.1
>>> import theano; print theano.__version__
0.8.2.dev-901275534cbfe3fbbe290ce85d1abf8bb9a5b203
>>> import tensorflow; print tensorflow.__version__
0.10.0rc0

[@knl ~]$ which caffe
/opt/intel/intelpython27/bin/caffe
[@knl ~]$ caffe --version
caffe version 1.0.0-rc3
In the above example, to access Python version 3.5, replace the intelpython27 with intelpython35.

7. Documentation

A presentation on using the Intel VTune Amplifier XE to tune software on the Intel Xeon Phi code named Knights Landing (KNL) is available
here. A more recent copy of this document will also be available from Intel's site here.

Contact: Amitoj Singh
Last modified: Nov 6, 2017