Wilson HPC Computing Facility

Hardware Information for CPU-based worker nodes


There are two types of CPU-based worker nodes on the Wilson cluster, "intel12" and "amd32".

intel12:

Component Type
CPU Intel dual-socket six-core (total 12 cores) Xeon 64-bit X5650 "Westmere" 2.67 GHz
Chipset Intel X5520 "Tylersburg"
Memory 12GB DDR3 1333MHz ECC
Motherboard Supermicro X8DT3
Network Two Gigabit Ethernet and one Single Data Rate (10 Gbps) Infiniband interface

These nodes are named tev0101 through tev0213. The network interfaces are configures as follows:

Component Network Description
eth0 ipmi access to worker node BMC (ipmitevXXYY)
eth1 service tevXXYY
ib0 infiniband NFS mounts from this network (ibtevXXYY)

The intel12 nodes consist of a single 250GB SATA disk formatted as follows:

Partition Size Description
sda1 100 MB DOS
sda2 1 GB boot
sda3 14 GB root
sda5 4 GB swap
sda6 212 GB scratch space

The following graphic provides an abstraction of the hierarchical topology of the CPUs installed in the intel12 worker nodes. The graphic was generated using The Portable Hardware Locality (hwloc) software package. The graphic lists NUMA memory nodes, sockets, shared caches and cores.

amd32:

Component Type
CPU AMD 6128 HE Opteron 2GHz quad-socket eight-core (32 cores total)
Memory 64GB DDR2 400 SDRAM
Motherboard Supermicro H8QGL
Network Two Gigabit Ethernet and one Double Data Rate (20 Gbps) Infiniband interface

These nodes are named tev0301 through tev0510. There are 12 nodes each in racks 3 and 4, and 10 nodes in rack 5. The Ethernet interfaces are configures as follows:

Component Network Description
eth0 ipmi access to worker node BMC (ipmitev0301)
eth1 service (tev0301)
ib0 infiniband NFS mounts from this network (ibtev0301)

The amd32 nodes consist of a single 2 TB SATA disk formatted as follows:

Partition Size Description
sda1 100 MB DOS
sda2 1.9 GB boot
sda3 97 GB root
sda5 4 GB swap
sda6 1.7 TB scratch

The following graphic provides an abstraction of the hierarchical topology of the CPUs installed in the amd32 worker nodes. The graphic was generated using The Portable Hardware Locality (hwloc) software package. The graphic lists NUMA memory nodes, sockets, shared caches and cores.


The intel12 and amd32 worker nodes NFS mount disks from the head node (tev) and the file server (tevnfs) using ip over ib on the Infiniband network as follows:

Mount Description Data Type
/usr/local NFS mounted from head node (ibtev) Common user applications, compilers and system tools.
Backup: YES Daily incremental backups to TiBs.
/home NFS mounted from head node (ibtev) User home area with 6GB quota limit per user.
Backup: YES Daily incremental backups to TiBs.
/data NFS mounted from file server (ibtevnfs) User application data area with 30GB quota limit per user.
Backup: NO
/fast NFS mounted from file server (ibtev0213) User application high throughput scratch space area with 30GB quota limit per project.
Backup: NO
/fnal/ups NFS mounted from head node (ibtev) Shared Fermilab UPS products area
Backup: YES Daily incremental backups to TiBs.

Hardware Information for GPU-based worker nodes

Contents

  1. Cluster layout
  2. Accessing GPU hosts
  3. Building Code

Disclaimer

This is not documentation about NVIDIA GPU programming or performance, but rather a set of notes for using the Fermilab GPU cluster.

1. Cluster Layout

The GPU cluster consists five GPU host servers (gpu1, gpu2 gpu3 gpu4 and ibmpower9).

gpu1Dual 6-core Intel 2.0GHz E5-2620 "Sandy Bridge" CPU, 32GB of memory, Dual NVIDIA Tesla Kepler K20m GPUs with 5GB of memory each (after ECC overhead)

gpu1 Hardware Topology

gpu1 GPU communication matrix

gpu2Dual 8-core Intel 2.6GHz E5-2650 "Ivy Bridge" CPU, 32GB of memory, Dual NVIDIA Tesla Kepler K40m GPUs with 12GB of memory each (after ECC overhead)

gpu2 Hardware Topology

gpu2 GPU communication matrix

gpu3Dual 14-core Intel 2.4GHz E5-2680 "Broadwell" CPU, 128GB of memory, Dual NVIDIA Tesla Pascal P100 GPUs with 17GB of memory each (after ECC overhead), NVLINK connect between GPUs

gpu3 Hardware Topology

gpu3 GPU communication matrix

gpu4Dual 8-core Intel 1.7GHz E5-2609v4 "Broadwell" CPU, 768GB of memory, Eight NVIDIA Tesla Pascal P100 GPUs with 17GB of memory each (after ECC overhead)

gpu4 Hardware Topology

gpu4 GPU communication matrix

ibmpower9Dual IBM Power9 16-core 2.6/3GHz CPU, 1TB of memory, Four NVIDIA Volta V100 GPUs with 17GB of memory each (after ECC overhead), NVLINK connect between GPUs

ibmpower9 Hardware Topology

ibmpower9 GPU communication matrix

  • Inter-host Networking: QDR Infiniband, connecting only the hosts
  • All hosts above, mount /home from the Wilson cluster head node tev.fnal.gov which is backed up
  • All hosts above, NFS mount /data which is NOT backed up
  • 2. Accessing GPU Hosts

    3. Building Code

    For the NVIDIA GPU, CUDA a parallel computing platform and programming model invented by NVIDIA is available on each GPU host under /usr/local/cuda. On the GPU server host run /usr/local/cuda/bin/nvcc -V to query the CUDA version and cat /proc/driver/nvidia/version to query the NVIDIA driver version.

    Hardware Information for Intel Knight's Landing on the Wilson Cluster

    Contents

    1. Cluster Layout
    2. Introduction to KNL
    3. Accessing KNL Host
    4. Building Code for the KNL
    5. Code Execution on KNL
    6. Python on KNL
    7. Documentation

    1. Cluster Layout

    2. Introduction to KNL (Intel Knight's Landing)

    The Knight's Landing host attached to the Wilson cluster is a Ninja Developer Platform configured by Colfax (
    http://dap.xeonphi.com/). This comes fully configured with memory, local storage, CentOS 7.2, and Intel tools. Included in the package is a one year license for Intel Parallel Studio XE Cluster Edition. The hardware configuration of this machine is as follows:
    • PROCESSOR: Developer Edition of Intel Xeon Phi Processor: Single Socket Intel Knight's Landing 1.30 GHz, 64 core
    • MEMORY: 96GB (6x 16GB dimms) 2133MHz DDR4 main memory, 16GB MCDRAM memory (a.k.a High Bandwidth Memory)
    • LAN: 2x IntelĀ® i350 Gigabit Ethernet
    • OS: CentOS 7.21
    • Compiler: Intel Parallel Studio XE Cluster Edition

    Photo: Ninja Developer Platform: Knight's Landing host

    The following graphic provides an abstraction of the hierarchical topology of the CPU installed in the KNL host. The graphic was generated using The Portable Hardware Locality (hwloc) software package. The graphic lists NUMA memory nodes, sockets, shared caches and cores. Note: Please click on the image for a larger more readable version. A textual version of this graphic is available here.


    Photo: Hierarchical topology of KNL CPU.

    On the Wilson cluster we have two different models of the Intel Phi processors, the first generation codenamed as Knights Corner (KNC, these are the phi1-phi4 nodes) and the second generation known as Knight Landing (KNL, this is the knl1 node). KNC-based processors are very similar to original Pentium cores from circa 1993, except:

    • They are 64 bit (x86_64).
    • They execute code strictly in-order, so no branch prediction or speculative execution.
    • They have 512-bit-wide vector floating point units, instead of the 128-bit-wide (SSE) or 256-bit-wide (AVX) units on Xeon and AMD Opteron processors.
    Each of the 5110P Phi cards has 60 of these cores. The latency of the floating point units is at least 4 cycles depending on the instruction, and the cores maintain 4 hardware threads of execution. So to achieve good performance, on each core the code should keep as many as 4 of these threads busy. Ideally each of the 60 cores properly fed will retire a 512-bit-wide floating point instruction per cycle. Since fused-multiply-adds are supported, this works out to 60 x 8 x 2 x 1052 MHz = 1.0 TFlop/sec (double precision) per Phi card.

    The Linux kernel running on the KNC cards sees each of the hardware threads as a separate core. So, the kernel reports a total of 240 cores. The cores share the 8 GB of system memory installed on the KNC card.

    The Second generation Knight's Landing (KNL) introduced many improvements over the KNC coprocessors which are:

    • This is a stand-alone processor, i.e., the KNL node does not have a separate "host" processor.
    • It introduces a new memory architecture utilizing two types of memory: 'close-to-metal' configurable MCDRAM and conventional DDR.
    • It introduces a new AVX-512 vector instructions which are architecturally consistent with AVX2 supported by Haswell and Broadwell CPUs
    • It is binary compatible with prior Intel processors
    Combined with a new fabrication process, the KNL triples both scalar and vector performance compared with KNC and offers, therefore, up to 3.0 TFlop/sec (double precision) per processor.

    The KNL design has a new compute unit, the tile, that consists of two two-wide, out-of-order cores and supports four hyper-threads. That is, the 64-core version of KNL chip comprises 32 tiles in total that are communicating via an on-die cache-coherent, two-dimensional mesh interconnect. The mesh supports three modes:

    1. all-to-all
    2. quadrant (set by default)
    3. sub-NUMA clustering.
    The all-to-all mode might be required when memory capacity is not uniform across all channels, while sub-NUMA mode is preferable for applications which are using more then one MPI rank per processor. The KNL memory architecture has two type of memory: the high-bandwidth memory integrated on package, MCDRAM, with capacity upto 16 GB and peak bandwidth over 450 GB/s and external DDR with capacity upto 384 GB (64 GB per channel) and peak bandwidth around 90GB/s. The memory can be configure at boot time in the following three modes:
    1. cache (MCDRAM is treated as cache for DDR)
    2. flat (MCDRAM is treated as standard memory in the same address space as DDR)
    3. hybrid (a portion of MCDRAM is cache and remaining is flat).
    To change the default mesh or memory mode settings please email us at tev-admin@fnal.gov

    The following graphic summarizes the various High Bandwidth Memory (HBM) modes:

    Photo: High Bandwidth Memory Modes. Image courtesy of anandtech.com

    3. Accessing KNL Host

    4. Building Code for the KNL

    Intel Parallel Studio XE Cluster Edition version 16.0.3 is available on the KNL host. To setup your environment (PATH, LD_LIBRARY_PATH, etc.), do the following AFTER submitting an "interactive" job to the knl host as instructed in section 3.2 above.
    $> source /opt/intel/parallel_studio_xe_2016.3.067/psxevars.sh intel64
    To build for the KNL processors, use the "-xMIC-AVX512" architecture flag on your compile and link lines. For binary compatibility use "-xCOMMON-AVX512".

    5. Code Execution on KNL

    KNL code example In /phihome/djholm/test there is the standard "Hello, world" C-language example, compiled for Phi (hello.knl). Below is an example to run directly:

    [tev.fnal.gov:]$> qsub -q knl -l nodes=1:knl -A phiadmin -I
    qsub: waiting for job 115338.tev.fnal.gov to start
    qsub: job 115338.tev.fnal.gov ready
    
    PBS prologue
    
    knl1:~$ cd /phihome/djholm/test
    
    knl1:/phihome/djholm/test$ ./hello.knl 
      Hello, world
    
    To run multithreaded application, set up environment variables as follows:
    $> export KMP_AFFINITY=compact, granularity=thread
    For example, to execute OpenMP version of the "Hello World", omphelloknl, with 64 threads, do
    $> export KMP_PLACE_THREADS=1s,64c,1t
    This will launch omphelloknl on KNL processor using one thread per core. To run it in 4 threads, set
    export KMP_PLACE_THREADS=1s,64c,4t
    Notes on KNL NUMA management: All of the MCDRAM can be used as program allocatable memory ( flat mode ) . In this mode,the entirety of DDR space and MCDRAM space is visible to the operating system and applications. It is exposed as separate NUMA nodes. To find out NUMA nodes distribution, use numactl utility with switch "-H". For example, in the quadrant or all-to-all cluster mode and flat memory mode, numactl -H will show 2 numa nodes, with MCDRAM corresponding to node 1. In order run an application with all allocations going to MCDRAM, one need to set:

    numactl -m 1 
    If memory mode is set to cache, there is only one NUMA memory node, numbered zero, which correspond to DDR .

    In the most complicated case of sub-NUMA cluster and flat memory modes, the number of NUMA nodes will be assigned according to the division of KNL tile cluster into NUMA sub-domains. DDR partitioning will be listed first and MCDRAM partitioning will be listed last.

    6. Python on KNL

    Intel Python version 2.7 and 3.5 are installed under /opt/intel/intelpython27/ and /opt/intel/intelpython35/. Machine Learning frameworks and modules such as Caffe, Tensorflow, Lasagne and Theano are also installed under those areas.
    [@knl ~]$ export PATH=/opt/intel/intelpython27/bin/:$PATH
    [@knl ~]$ python -V
    Python 2.7.12 :: Intel Corporation
    
    [@knl ~]$ python
    Python 2.7.12 |Intel Corporation| (default, Aug 15 2016, 04:18:18) 
    [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Intel(R) Distribution for Python is brought to you by Intel Corporation.
    Please check out: https://software.intel.com/en-us/python-distribution
    >>> import lasagne; print lasagne.__version__
    0.1
    >>> import theano; print theano.__version__
    0.8.2.dev-901275534cbfe3fbbe290ce85d1abf8bb9a5b203
    >>> import tensorflow; print tensorflow.__version__
    0.10.0rc0
    
    [@knl ~]$ which caffe
    /opt/intel/intelpython27/bin/caffe
    [@knl ~]$ caffe --version
    caffe version 1.0.0-rc3
    
    In the above example, to access Python version 3.5, replace the intelpython27 with intelpython35.

    7. Documentation

    A presentation on using the Intel VTune Amplifier XE to tune software on the Intel Xeon Phi code named Knights Landing (KNL) is available
    here. A more recent copy of this document will also be available from Intel's site here.

    Contact: Webadmin
    Last modified: Jan 17, 2020