Getting Started with Kokkos-MOOSE

Introduction

Kokkos-MOOSE is a part of the MOOSE framework which utilizes Kokkos to port the foundational components of MOOSE to GPU while retaining the original design as much as possible. Kokkos provides a unified programming model for different GPU architectures by abstracting hardware-specific details, enabling developers to write a single codebase that can be compiled and executed on various platforms without the need for extensive code modification. It is challenging to maintain portability across different architectures in today’s rapidly evolving hardware landscape, and the portability of Kokkos makes it suitable as a programming model for a framework like MOOSE which should serve diverse userbase. While Kokkos-MOOSE has been only tested with NVIDIA GPUs thus far, the support for AMD and Intel GPUs will be added in the future as we diversify our testing hardware environment at Idaho National Laboratory (INL).

Installing Kokkos-MOOSE

Installation instructions for Kokkos-MOOSE can be found in this page.

Developing with Kokkos-MOOSE

Programming for GPU is fundamentally different from the ordinary CPU programming in various aspects. Here we provide instructions on some programming practices to write an efficient and robust GPU code in Kokkos-MOOSE.

Separate Memory Space

Except for a very few rare cases that provide physically unified memory space for CPU and GPU (such as AMD MI300A), most GPUs have separate memory spaces from their conuterpart CPUs. Therefore, the you need to take a special care to properly identify which data are accessible and not accessible on either CPU or GPU and whether the data on CPU and GPU are properly synchronized. Standard containers such as std::vector, std::set, std::map and others are not usable on GPU, and managed pointers are also inaccessible on GPU. Basically, it is safe to assume that any dynamically-allocated data on CPU cannot be accessed on GPU. Therefore, we provide alternative data containers to be used on GPU: Moose::Kokkos::Array and Moose::Kokkos::Map.

Moose::Kokkos::Array is a template class and designed to hold arbitrary type of data. It receives two template arguments: data type and dimension. It supports multi-dimensional indexing, and up to five-dimensional arrays are supported. The dimension can either be specified through the second template argument with the default being one-dimension or using type aliases: for instance, a three-dimensional array of type double can be declared either by Array<double, 3> or Array3D<double>. The entries of an array can be accessed with either operator() with multi-dimensional indices or operator[] with a dimensionless (1D) index, and they automatically return either CPU or GPU data depending on where they are being accessed. Arrays can be allocated through the following APIs: create(), createHost(), and createDevice(). create() allocates memories on both CPU and GPU, while createHost() or createDevice() only allocates memory on either CPU or GPU. It is important to note that if the creation APIs are called for an initialized array, the original array will be destroyed and a new array will be created. Namely, you cannot allocate data on both CPU and GPU by calling createHost() and createDevice() separately. Calling the creation APIs for a shallow copy of an array will disassociate it with the original array.

Instead of allocating memory, an array can also wrap an external pointer. Instead of calling create(), you can call init() which only initializes dimensional information without allocating data, and then call aliasHost() or aliasDevice() to let CPU or GPU pointer wrap an external pointer. You can also combine createHost() or createDevice() along with aliasHost() or aliasDevice() to allocate data on one side and alias data on the other side. In this case, createHost() or createDevice() should be called first. The following example shows wrapping a CPU PETSc vector and creating its GPU clone with an array:


PetscScalar * petsc_ptr;
PetscInt petsc_size;
VecGetArray(petsc_vector, &petsc_ptr);
VecGetLocalSize(petsc_vector, &petsc_size);

Array<PetscScalar> vector;

vector.createDevice(petsc_size);
vector.aliasHost(petsc_ptr);
vector.copyToDevice();

If the data type is not default-constructable, create() will only allocate a raw block of uninitialized memory using malloc(). It is your responsibility to loop over the array and perform placement new to properly construct each entry. For example:


Array<NotDefaultConstructable> data;

data.create(n);

for (auto & datum : data)
  new (&datum) NotDefaultConstructable(...);
commentnote

The GPU memory is always uninitialized upon allocation.

Because the memory spaces are separate between CPU and GPU, there is no automatic synchronzation of data. Therefore, data should be explicitly copied between them when one side is updated. Arrays provide three APIs for this purpose: copyToHost(), copyToDevice(), and copyToDeviceNested(). copyToHost() and copyToDevice() copies data from GPU to CPU and CPU to GPU, respectively, as their names imply. copyToDeviceNested() is a shortcut for copying data from CPU to GPU for a nested array like Array<Array<...>>, and it calls copyToDevice() for all the inner arrays and itself in the correct order. When creating a nested array, it is important to note that the arrays should be copied from the innermost to the outermost in order:


Array<Array<Array<Real>>> data;

data.create(n1);

for (unsigned int i = 0; i < n1; ++i)
{
  data[i].create(n2);

  for (unsigned int j = 0; j < n2; ++j)
  {
    data[i][j].create(n3);
    ... // Populating data
    data[i][j].copyToDevice();
  }

  data[i].copyToDevice();
}

data.copyToDevice();

The reason copyToDevice() should also be called for the outer arrays is because the pointers held by the inner arrays changed by memory allocation. However, this nested call can be replaced by a single call to copyToDeviceNested() for the outermost array:


Array<Array<Array<Real>>> data;

data.create(n1);

for (unsigned int i = 0; i < n1; ++i)
{
  data[i].create(n2);

  for (unsigned int j = 0; j < n2; ++j)
  {
    data[i][j].create(n3);
    ... // Populating data
  }
}

data.copyToDeviceNested();

Listing 1: The Moose::Kokkos::Array source code.


#pragma once

#ifdef MOOSE_KOKKOS_SCOPE
#include "KokkosHeader.h"
#endif

#include "Conversion.h"
#include "DataIO.h"

#define usingKokkosArrayBaseMembers(T, dimension)                                                  \
private:                                                                                           \
  using ArrayBase<T, dimension>::_n;                                                               \
  using ArrayBase<T, dimension>::_s;                                                               \
  using ArrayBase<T, dimension>::_d;                                                               \
                                                                                                   \
public:                                                                                            \
  using ArrayBase<T, dimension>::create;                                                           \
  using ArrayBase<T, dimension>::createHost;                                                       \
  using ArrayBase<T, dimension>::createDevice;                                                     \
  using ArrayBase<T, dimension>::offset;                                                           \
  using ArrayBase<T, dimension>::operator=

namespace Moose
{
namespace Kokkos
{

// This function simply calls ::Kokkos::kokkos_free, but it is separately defined in KokkosArray.K
// because the Kokkos function cannot be directly seen by the host compiler
void free(void * ptr);

/**
 * The enumerator that dictates the memory copy direction
 */
enum class MemcpyKind
{
  HOST_TO_HOST,
  HOST_TO_DEVICE,
  DEVICE_TO_HOST,
  DEVICE_TO_DEVICE
};

/**
 * The Kokkos array class
 */
template <typename T, unsigned int dimension = 1>
class Array;

/**
 * The type trait that determines the default behavior of copy constructor and deepCopy()
 * If this type trait is set to true, the copy constructor will call deepCopy(),
 * and the deepCopy() method will copy-construct each entry.
 * If this type trait is set to false, the copy constructor will call shallowCopy(),
 * and the deepCopy() method will do a memory copy.
 */
///@{
template <typename T>
struct ArrayDeepCopy
{
  static const bool value = false;
};

template <typename T, unsigned int dimension>
struct ArrayDeepCopy<Array<T, dimension>>
{
  static const bool value = ArrayDeepCopy<T>::value;
};
///@}

/**
 * The base class for Kokkos arrays
 */
template <typename T, unsigned int dimension>
class ArrayBase
{
public:
  /**
   * Default constructor
   */
  ArrayBase() = default;

  /**
   * Copy constructor
   */
  ArrayBase(const ArrayBase<T, dimension> & array)
  {
#ifndef MOOSE_KOKKOS_SCOPE
    static_assert(!ArrayDeepCopy<T>::value,
                  "Kokkos array cannot be deep copied outside the Kokkos compilation scope");
#endif

    if constexpr (ArrayDeepCopy<T>::value)
      deepCopy(array);
    else
      shallowCopy(array);
  }

  /**
   * Destructor
   */
  ~ArrayBase() { destroy(); }

  /**
   * Free all data and reset
   */
  void destroy();

  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  void shallowCopy(const ArrayBase<T, dimension> & array);

  /**
   * Get the reference count
   * @returns The reference count
   */
  unsigned int useCount() const { return _counter.use_count(); }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Get whether the array was allocated either on host or device
   * @returns Whether the array was allocated either on host or device
   */
  KOKKOS_FUNCTION bool isAlloc() const { return _is_host_alloc || _is_device_alloc; }
  /**
   * Get whether the array was allocated on host
   * @returns Whether the array was allocated on host
   */
  KOKKOS_FUNCTION bool isHostAlloc() const { return _is_host_alloc; }
  /**
   * Get whether the array was allocated on device
   * @returns Whether the array was allocated on device
   */
  KOKKOS_FUNCTION bool isDeviceAlloc() const { return _is_device_alloc; }
  /**
   * Get whether the host array was aliased
   * @returns Whether the host array was aliased
   */
  KOKKOS_FUNCTION bool isHostAlias() const { return _is_host_alias; }
  /**
   * Get whether the device array was aliased
   * @returns Whether the device array was aliased
   */
  KOKKOS_FUNCTION bool isDeviceAlias() const { return _is_device_alias; }
  /**
   * Get the total array size
   * @returns The total array size
   */
  KOKKOS_FUNCTION dof_id_type size() const { return _size; }
  /**
   * Get the size of a dimension
   * @param dim The dimension index
   * @returns The array size of the dimension
   */
  KOKKOS_FUNCTION dof_id_type n(unsigned int dim) const { return _n[dim]; }
  /**
   * Get the data pointer
   * @returns The pointer to the underlying data depending on the architecture this function is
   * being called on
   */
  KOKKOS_FUNCTION T * data() const
  {
    KOKKOS_IF_ON_HOST(return _host_data;)

    return _device_data;
  }
  /**
   * Get the first element
   * @returns The reference of the first element depending on the architecture this function is
   * being called on
   */
  KOKKOS_FUNCTION T & first() const
  {
    KOKKOS_IF_ON_HOST(return _host_data[0];)

    return _device_data[0];
  }
  /**
   * Get the last element
   * @returns The reference of the last element depending on the architecture this function is being
   * called on
   */
  KOKKOS_FUNCTION T & last() const
  {
    KOKKOS_IF_ON_HOST(return _host_data[_size - 1];)

    return _device_data[_size - 1];
  }
  /**
   * Get an array entry
   * @param i The dimensionless index
   * @returns The reference of the entry depending on the architecture this function is being called
   * on
   */
  KOKKOS_FUNCTION T & operator[](dof_id_type i) const
  {
    KOKKOS_ASSERT(i < _size);

    KOKKOS_IF_ON_HOST(return _host_data[i];)

    return _device_data[i];
  }

  /**
   * Get the host data pointer
   * @returns The pointer to the underlying host data
   */
  T * hostData() const { return _host_data; }
  /**
   * Get the device data pointer
   * @returns The pointer to the underlying device data
   */
  T * deviceData() const { return _device_data; }
  /**
   * Allocate array on host and device
   * @param n The vector containing the size of each dimension
   */
  void create(const std::vector<dof_id_type> & n) { createInternal<true, true>(n); }
  /**
   * Allocate array on host only
   * @param n The vector containing the size of each dimension
   */
  void createHost(const std::vector<dof_id_type> & n) { createInternal<true, false>(n); }
  /**
   * Allocate array on device only
   * @param n The vector containing the size of each dimension
   */
  void createDevice(const std::vector<dof_id_type> & n) { createInternal<false, true>(n); }
  /**
   * Point the host data to an external data instead of allocating it
   * @param ptr The pointer to the external host data
   */
  void aliasHost(T * ptr);
  /**
   * Point the device data to an external data instead of allocating it
   * @param ptr The pointer to the external device data
   */
  void aliasDevice(T * ptr);
  /**
   * Apply starting index offsets to each dimension
   * @param d The vector containing the offset of each dimension
   */
  void offset(const std::vector<dof_id_signed_type> & d);
  /**
   * Copy data from host to device
   */
  void copyToDevice();
  /**
   * Copy data from device to host
   */
  void copyToHost();
  /**
   * Copy data from an external data to this array
   * @param ptr The pointer to the external data
   * @param dir The copy direction
   * @param n The number of entries to copy
   * @param offset The starting offset of this array
   */
  void copyIn(const T * ptr, MemcpyKind dir, dof_id_type n, dof_id_type offset = 0);
  /**
   * Copy data to an external data from this array
   * @param ptr The pointer to the external data
   * @param dir The copy direction
   * @param n The number of entries to copy
   * @param offset The starting offset of this array
   */
  void copyOut(T * ptr, MemcpyKind dir, dof_id_type n, dof_id_type offset = 0);
  /**
   * Copy all the nested Kokkos arrays including self from host to device
   */
  void copyToDeviceNested();
  /**
   * Deep copy another Kokkos array
   * If ArrayDeepCopy<T>::value is true, it will copy-construct each entry
   * If ArrayDeepCopy<T>::value is false, it will do a memory copy
   * @param array The Kokkos array to be deep copied
   */
  void deepCopy(const ArrayBase<T, dimension> & array);
  /**
   * Swap with another Kokkos array
   * @param array The Kokkos array to be swapped
   */
  void swap(ArrayBase<T, dimension> & array);

  /**
   * Assign a scalar value uniformly
   * @param scalar The scalar value to be assigned
   */
  auto & operator=(const T & scalar);

  /**
   * Array iterator
   */
  class iterator
  {
  public:
    KOKKOS_FUNCTION iterator(T * it) : it(it) {}
    KOKKOS_FUNCTION bool operator==(const iterator & other) const { return it == other.it; }
    KOKKOS_FUNCTION bool operator!=(const iterator & other) const { return it != other.it; }
    KOKKOS_FUNCTION T & operator*() const { return *it; }
    KOKKOS_FUNCTION T * operator&() const { return it; }
    KOKKOS_FUNCTION iterator & operator++()
    {
      ++it;
      return *this;
    }
    KOKKOS_FUNCTION iterator operator++(int)
    {
      iterator pre = *this;
      ++it;
      return pre;
    }

  private:
    T * it;
  };

  /**
   * Get the beginning iterator
   * @returns The beginning iterator
   */
  KOKKOS_FUNCTION iterator begin() const
  {
    KOKKOS_IF_ON_HOST(return iterator(_host_data);)

    return iterator(_device_data);
  }
  /**
   * Get the end iterator
   * @returns The end iterator
   */
  KOKKOS_FUNCTION iterator end() const
  {
    KOKKOS_IF_ON_HOST(return iterator(_host_data + _size);)

    return iterator(_device_data + _size);
  }
#endif

protected:
  /**
   * Size of each dimension
   */
  dof_id_type _n[dimension] = {0};
  /**
   * Stride of each dimension
   */
  dof_id_type _s[dimension] = {0};
  /**
   * Offset of each dimension
   */
  dof_id_signed_type _d[dimension] = {0};

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * The internal method to initialize and allocate this array
   * @tparam host Whether host data will be allocated
   * @tparam device Whether device data will be allocated
   * @param n The vector containing the size of each dimension
   */
  template <bool host, bool device>
  void createInternal(const std::vector<dof_id_type> & n);
  /**
   * The internal method to initialize and allocate this array
   * @param n The vector containing the size of each dimension
   * @param host The flag whether host data will be allocated
   * @param device The flag whether device data will be allocated
   */
  void createInternal(const std::vector<dof_id_type> & n, bool host, bool device);
  /**
   * The internal method to perform a memory copy
   * @tparam TargetSpace The Kokkos memory space of target data
   * @tparam Sourcespace The Kokkos memory space of source data
   * @param target The pointer to the target data
   * @param source The pointer to the source data
   * @param n The number of entries to copy
   */
  template <typename TargetSpace, typename SourceSpace>
  void copyInternal(T * target, const T * source, dof_id_type n);
#endif

private:
#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Allocate host data for an initialized array that has not allocated host data
   */
  void allocHost();
  /**
   * Allocate device data for an initialized array that has not allocated device data
   */
  void allocDevice();
#endif

  /**
   * Reference counter
   */
  std::shared_ptr<unsigned int> _counter;
  /**
   * Flag whether array was initialized
   */
  bool _is_init = false;
  /**
   * Flag whether host data was allocated
   */
  bool _is_host_alloc = false;
  /**
   * Flag whether device data was allocated
   */
  bool _is_device_alloc = false;
  /**
   * Flag whether the host data points to an external data
   */
  bool _is_host_alias = false;
  /**
   * Flag whether the device data points to an external data
   */
  bool _is_device_alias = false;
  /**
   * Host data
   */
  T * _host_data = nullptr;
  /**
   * Device data
   */
  T * _device_data = nullptr;
  /**
   * Total size
   */
  dof_id_type _size = 0;
};

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::destroy()
{
  if (!_counter)
    return;

  if (_counter.use_count() > 1)
  {
    _host_data = nullptr;
    _device_data = nullptr;
  }
  else if (_counter.use_count() == 1)
  {
    if (_is_host_alloc && !_is_host_alias)
    {
      if constexpr (std::is_default_constructible<T>::value)
        // Allocated by new
        delete[] _host_data;
      else
      {
        // Allocated by malloc
        for (dof_id_type i = 0; i < _size; ++i)
          _host_data[i].~T();

        std::free(_host_data);
      }
    }

    if (_is_device_alloc && !_is_device_alias)
      Moose::Kokkos::free(_device_data);
  }

  _size = 0;

  for (unsigned int i = 0; i < dimension; ++i)
  {
    _n[i] = 0;
    _s[i] = 0;
    _d[i] = 0;
  }

  _is_init = false;
  _is_host_alloc = false;
  _is_device_alloc = false;
  _is_host_alias = false;
  _is_device_alias = false;

  _counter.reset();
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::shallowCopy(const ArrayBase<T, dimension> & array)
{
  destroy();

  _counter = array._counter;

  _size = array._size;

  for (unsigned int i = 0; i < dimension; ++i)
  {
    _n[i] = array._n[i];
    _s[i] = array._s[i];
    _d[i] = array._d[i];
  }

  _is_init = array._is_init;
  _is_host_alloc = array._is_host_alloc;
  _is_device_alloc = array._is_device_alloc;
  _is_host_alias = array._is_host_alias;
  _is_device_alias = array._is_device_alias;

  _host_data = array._host_data;
  _device_data = array._device_data;
}

#ifdef MOOSE_KOKKOS_SCOPE
template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::aliasHost(T * ptr)
{
  if (!_is_init)
    mooseError("Kokkos array error: attempted to alias host data before array initialization.");

  if (_is_host_alloc && !_is_host_alias)
    mooseError("Kokkos array error: cannot alias host data because host data was not aliased.");

  _host_data = ptr;
  _is_host_alloc = true;
  _is_host_alias = true;
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::aliasDevice(T * ptr)
{
  if (!_is_init)
    mooseError("Kokkos array error: attempted to alias device data before array initialization.");

  if (_is_device_alloc && !_is_device_alias)
    mooseError("Kokkos array error: cannot alias device data because device data was not aliased.");

  _device_data = ptr;
  _is_device_alloc = true;
  _is_device_alias = true;
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::allocHost()
{
  if (_is_host_alloc)
    return;

  if constexpr (std::is_default_constructible<T>::value)
    _host_data = new T[_size];
  else
    _host_data = static_cast<T *>(std::malloc(_size * sizeof(T)));

  _is_host_alloc = true;
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::allocDevice()
{
  if (_is_device_alloc)
    return;

  _device_data =
      static_cast<T *>(::Kokkos::kokkos_malloc<ExecSpace::memory_space>(_size * sizeof(T)));

  _is_device_alloc = true;
}

template <typename T, unsigned int dimension>
template <bool host, bool device>
void
ArrayBase<T, dimension>::createInternal(const std::vector<dof_id_type> & n)
{
  if (n.size() != dimension)
    mooseError("Kokkos array error: the number of dimensions provided (",
               n.size(),
               ") must match the array dimension (",
               dimension,
               ").");

  if (_counter)
    destroy();

  _counter = std::make_shared<unsigned int>();

  _size = 1;
  _s[0] = 1;

  for (const auto i : make_range(dimension))
  {
    _n[i] = n[i];
    _size *= n[i];

    if (i)
      _s[i] = _s[i - 1] * _n[i - 1];
  }

  if constexpr (host)
    allocHost();

  if constexpr (device)
    allocDevice();

  _is_init = true;
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::createInternal(const std::vector<dof_id_type> & n, bool host, bool device)
{
  if (host && device)
    createInternal<true, true>(n);
  else if (host && !device)
    createInternal<true, false>(n);
  else if (!host && device)
    createInternal<false, true>(n);
  else
    createInternal<false, false>(n);
}

template <typename T, unsigned int dimension>
template <typename TargetSpace, typename SourceSpace>
void
ArrayBase<T, dimension>::copyInternal(T * target, const T * source, dof_id_type n)
{
  ::Kokkos::Impl::DeepCopy<TargetSpace, SourceSpace>(target, source, n * sizeof(T));
  ::Kokkos::fence();
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::offset(const std::vector<dof_id_signed_type> & d)
{
  if (d.size() > dimension)
    mooseError("Kokkos array error: the number of offsets provided (",
               d.size(),
               ") cannot be larger than the array dimension (",
               dimension,
               ").");

  for (const auto i : index_range(d))
    _d[i] = d[i];
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::copyToDevice()
{
  // If host side memory is not allocated, do nothing
  if (!_is_host_alloc)
    return;

  // If device side memory is not allocated,
  if (!_is_device_alloc)
  {
    if (_counter.use_count() == 1)
      // allocate memory if this array is not shared with other arrays
      allocDevice();
    else
      // print error if this array is shared with other arrays
      mooseError("Kokkos array error: cannot copy from host to device because device side memory "
                 "was not allocated and array is being shared with other arrays.");
  }

  // Copy from host to device
  copyInternal<MemSpace, ::Kokkos::HostSpace>(_device_data, _host_data, _size);
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::copyToHost()
{
  // If device side memory is not allocated, do nothing
  if (!_is_device_alloc)
    return;

  // If host side memory is not allocated,
  if (!_is_host_alloc)
  {
    if (_counter.use_count() == 1)
      // allocate memory if this array is not shared with other arrays
      allocHost();
    else
      // print error if this array is shared with other arrays
      mooseError("Kokkos array error: cannot copy from device to host because host side memory "
                 "was not allocated and array is being shared with other arrays.");
  }

  // Copy from device to host
  copyInternal<::Kokkos::HostSpace, MemSpace>(_host_data, _device_data, _size);
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::copyIn(const T * ptr, MemcpyKind dir, dof_id_type n, dof_id_type offset)
{
  if (n > _size)
    mooseError("Kokkos array error: cannot copyin data larger than the array size.");

  if (offset > _size)
    mooseError("Kokkos array error: offset cannot be larger than the array size.");

  if (dir == MemcpyKind::HOST_TO_HOST)
  {
    // If host side memory is not allocated, do nothing
    if (!_is_host_alloc)
      return;

    // Copy from host to host
    copyInternal<::Kokkos::HostSpace, ::Kokkos::HostSpace>(_host_data + offset, ptr, n);
  }
  else if (dir == MemcpyKind::HOST_TO_DEVICE)
  {
    // If device side memory is not allocated, do nothing
    if (!_is_device_alloc)
      return;

    // Copy from host to device
    copyInternal<MemSpace, ::Kokkos::HostSpace>(_device_data + offset, ptr, n);
  }
  else if (dir == MemcpyKind::DEVICE_TO_HOST)
  {
    // If host side memory is not allocated, do nothing
    if (!_is_host_alloc)
      return;

    // Copy from device to host
    copyInternal<::Kokkos::HostSpace, MemSpace>(_host_data + offset, ptr, n);
  }
  else if (dir == MemcpyKind::DEVICE_TO_DEVICE)
  {
    // If device side memory is not allocated, do nothing
    if (!_is_device_alloc)
      return;

    // Copy from device to device
    copyInternal<MemSpace, MemSpace>(_device_data + offset, ptr, n);
  }
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::copyOut(T * ptr, MemcpyKind dir, dof_id_type n, dof_id_type offset)
{
  if (n > _size)
    mooseError("Kokkos array error: cannot copyout data larger than the array size.");

  if (offset > _size)
    mooseError("Kokkos array error: offset cannot be larger than the array size.");

  if (dir == MemcpyKind::HOST_TO_HOST)
  {
    // If host side memory is not allocated, do nothing
    if (!_is_host_alloc)
      return;

    // Copy from host to host
    copyInternal<::Kokkos::HostSpace, ::Kokkos::HostSpace>(ptr, _host_data + offset, n);
  }
  else if (dir == MemcpyKind::HOST_TO_DEVICE)
  {
    // If host side memory is not allocated, do nothing
    if (!_is_host_alloc)
      return;

    // Copy from host to device
    copyInternal<MemSpace, ::Kokkos::HostSpace>(ptr, _host_data + offset, n);
  }
  else if (dir == MemcpyKind::DEVICE_TO_HOST)
  {
    // If device side memory is not allocated, do nothing
    if (!_is_device_alloc)
      return;

    // Copy from device to host
    copyInternal<::Kokkos::HostSpace, MemSpace>(ptr, _device_data + offset, n);
  }
  else if (dir == MemcpyKind::DEVICE_TO_DEVICE)
  {
    // If device side memory is not allocated, do nothing
    if (!_is_device_alloc)
      return;

    // Copy from device to device
    copyInternal<MemSpace, MemSpace>(ptr, _device_data + offset, n);
  }
}

template <typename T>
void
copyToDeviceInner(T & /* data */)
{
}

template <typename T, unsigned int dimension>
void
copyToDeviceInner(Array<T, dimension> & data)
{
  data.copyToDeviceNested();
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::copyToDeviceNested()
{
  for (unsigned int i = 0; i < _size; ++i)
    copyToDeviceInner(_host_data[i]);

  copyToDevice();
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::deepCopy(const ArrayBase<T, dimension> & array)
{
  if (ArrayDeepCopy<T>::value && !array._is_host_alloc)
    mooseError(
        "Kokkos array error: cannot deep copy using constructor from array without host data.");

  std::vector<dof_id_type> n(std::begin(array._n), std::end(array._n));

  createInternal(n, array._is_host_alloc, array._is_device_alloc);

  if constexpr (ArrayDeepCopy<T>::value)
  {
    for (dof_id_type i = 0; i < _size; ++i)
      new (_host_data + i) T(array._host_data[i]);

    copyToDevice();
  }
  else
  {
    if (_is_host_alloc)
      std::memcpy(_host_data, array._host_data, _size * sizeof(T));

    if (_is_device_alloc)
      copyInternal<MemSpace, MemSpace>(_device_data, array._device_data, _size);
  }

  for (unsigned int i = 0; i < dimension; ++i)
  {
    _d[i] = array._d[i];
    _s[i] = array._s[i];
  }
}

template <typename T, unsigned int dimension>
void
ArrayBase<T, dimension>::swap(ArrayBase<T, dimension> & array)
{
  ArrayBase<T, dimension> clone;

  clone.shallowCopy(*this);
  this->shallowCopy(array);
  array.shallowCopy(clone);
}

template <typename T, unsigned int dimension>
auto &
ArrayBase<T, dimension>::operator=(const T & scalar)
{
  if (_is_host_alloc)
    std::fill_n(_host_data, _size, scalar);

  if (_is_device_alloc)
  {
    ::Kokkos::View<T *, MemSpace, ::Kokkos::MemoryTraits<::Kokkos::Unmanaged>> data(_device_data,
                                                                                    _size);
    ::Kokkos::Experimental::fill_n(ExecSpace(), data, _size, scalar);
  }

  return *this;
}

template <typename T, unsigned int dimension>
void
dataStore(std::ostream & stream, Array<T, dimension> & array, void * context)
{
  using ::dataStore;

  bool is_alloc = array.isAlloc();
  dataStore(stream, is_alloc, nullptr);

  if (!is_alloc)
    return;

  std::string type = typeid(T).name();
  dataStore(stream, type, nullptr);

  unsigned int dim = dimension;
  dataStore(stream, dim, nullptr);

  for (unsigned int dim = 0; dim < dimension; ++dim)
  {
    auto n = array.n(dim);
    dataStore(stream, n, nullptr);
  }

  if (array.isDeviceAlloc())
  {
    // We use malloc/free because we just want a memory copy
    // If T is a Kokkos array and we use new/delete or vector to copy it out,
    // the arrays will be destroyed on cleanup

    T * data = static_cast<T *>(std::malloc(array.size() * sizeof(T)));

    array.copyOut(data, MemcpyKind::DEVICE_TO_HOST, array.size());

    for (dof_id_type i = 0; i < array.size(); ++i)
      dataStore(stream, data[i], context);

    std::free(data);
  }
  else
    for (auto & value : array)
      dataStore(stream, value, context);
}

template <typename T, unsigned int dimension>
void
dataLoad(std::istream & stream, Array<T, dimension> & array, void * context)
{
  using ::dataLoad;

  bool is_alloc;
  dataLoad(stream, is_alloc, nullptr);

  if (!is_alloc)
    return;

  std::string from_type_name;
  dataLoad(stream, from_type_name, nullptr);

  if (from_type_name != typeid(T).name())
    mooseError("Kokkos array error: cannot load an array because the stored array is of type '",
               MooseUtils::prettyCppType(libMesh::demangle(from_type_name.c_str())),
               "' but the loading array is of type '",
               MooseUtils::prettyCppType(libMesh::demangle(typeid(T).name())),
               "'.");

  unsigned int from_dimension;
  dataLoad(stream, from_dimension, nullptr);

  if (from_dimension != dimension)
    mooseError("Kokkos array error: cannot load an array because the stored array is ",
               from_dimension,
               "D but the loading array is ",
               dimension,
               "D.");

  std::vector<dof_id_type> from_n(dimension);
  std::vector<dof_id_type> n(dimension);

  for (unsigned int dim = 0; dim < dimension; ++dim)
  {
    dataLoad(stream, from_n[dim], nullptr);
    n[dim] = array.n(dim);
  }

  if (from_n != n)
    mooseError("Kokkos array error: cannot load an array because the stored array has dimensions (",
               Moose::stringify(from_n),
               ") but the loading array has dimensions (",
               Moose::stringify(n),
               ").");

  if (array.isHostAlloc())
  {
    for (auto & value : array)
      dataLoad(stream, value, context);

    if (array.isDeviceAlloc())
      array.copyToDevice();
  }
  else
  {
    std::vector<T> data(array.size());

    for (auto & value : data)
      dataLoad(stream, value, context);

    array.copyIn(data.data(), MemcpyKind::HOST_TO_DEVICE, array.size());
  }
}
#endif

/**
 * The specialization of the Kokkos array class for each dimension.
 * All array data that needs to be accessed on device in Kokkos objects should use this class.
 * If the array is populated on host and is to be accessed on device, make sure to call
 * copyToDevice() after populating data. For a nested Kokkos array, either copyToDeviceNested()
 * should be called for the outermost array or copyToDevice() should be called for each instance of
 * Kokkos array from the innermost to the outermost. Do not store this object as reference in your
 * Kokkos object if it is used on device, because the reference refers to a host object and
 * therefore is not accessible on device. If storing it as a reference is required, see
 * ReferenceWrapper.
 */
///@{
template <typename T>
class Array<T, 1> : public ArrayBase<T, 1>
{
#ifdef MOOSE_KOKKOS_SCOPE
  usingKokkosArrayBaseMembers(T, 1);
#endif

public:
  /**
   * Default constructor
   */
  Array() = default;
  /**
   * Copy constructor
   */
  Array(const Array<T, 1> & array) : ArrayBase<T, 1>(array) {}
  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  auto & operator=(const Array<T, 1> & array)
  {
    this->shallowCopy(array);

    return *this;
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Constructor
   * Initialize and allocate array with given dimensions
   * This allocates both host and device data
   * @param n0 The first dimension size
   */
  Array(dof_id_type n0) { create(n0); }
  /**
   * Constructor
   * Initialize and allocate array by copying a standard vector variable
   * This allocates and copies to both host and device data
   * @param vector The standard vector variable to copy
   */
  Array(const std::vector<T> & vector) { *this = vector; }

  /**
   * Initialize array with given dimensions but do not allocate
   * @param n0 The first dimension size
   */
  void init(dof_id_type n0) { this->template createInternal<false, false>({n0}); }
  /**
   * Initialize and allocate array with given dimensions on host and device
   * @param n0 The first dimension size
   */
  void create(dof_id_type n0) { this->template createInternal<true, true>({n0}); }
  /**
   * Initialize and allocate array with given dimensions on host only
   * @param n0 The first dimension size
   */
  void createHost(dof_id_type n0) { this->template createInternal<true, false>({n0}); }
  /**
   * Initialize and allocate array with given dimensions on device only
   * @param n0 The first dimension size
   */
  void createDevice(dof_id_type n0) { this->template createInternal<false, true>({n0}); }
  /**
   * Set starting index offsets
   * @param d0 The first dimension offset
   */
  void offset(dof_id_signed_type d0) { ArrayBase<T, 1>::offset({d0}); }

  /**
   * Copy a standard vector variable
   * This re-initializes and re-allocates array with the size of the vector
   * @tparam host Whether to allocate and copy to the host data
   * @tparam device Whether to allocate and copy to the device data
   * @param vector The standard vector variable to copy
   */
  template <bool host, bool device>
  void copyVector(const std::vector<T> & vector)
  {
    this->template createInternal<host, device>({static_cast<dof_id_type>(vector.size())});

    if (host)
      std::memcpy(this->hostData(), vector.data(), this->size() * sizeof(T));

    if (device)
      this->template copyInternal<MemSpace, ::Kokkos::HostSpace>(
          this->deviceData(), vector.data(), this->size());
  }
  /**
   * Copy a standard set variable
   * This re-initializes and re-allocates array with the size of the set
   * @tparam host Whether to allocate and copy to the host data
   * @tparam device Whether to allocate and copy to the device data
   * @param set The standard set variable to copy
   */
  template <bool host, bool device>
  void copySet(const std::set<T> & set)
  {
    std::vector<T> vector(set.begin(), set.end());

    copyVector<host, device>(vector);
  }

  /**
   * Copy a standard vector variable
   * This allocates and copies to both host and device data
   * @param vector The standard vector variable to copy
   */
  auto & operator=(const std::vector<T> & vector)
  {
    copyVector<true, true>(vector);

    return *this;
  }
  /**
   * Copy a standard set variable
   * This allocates and copies to both host and device data
   * @param set The standard set variable to copy
   */
  auto & operator=(const std::set<T> & set)
  {
    copySet<true, true>(set);

    return *this;
  }

  /**
   * Get an array entry
   * @param i0 The first dimension index
   * @returns The reference of the entry depending on the architecture this function is being
   * called on
   */
  KOKKOS_FUNCTION T & operator()(dof_id_signed_type i0) const
  {
    KOKKOS_ASSERT(i0 - _d[0] >= 0 && static_cast<dof_id_type>(i0 - _d[0]) < _n[0]);

    return this->operator[](i0 - _d[0]);
  }
#endif
};

template <typename T>
class Array<T, 2> : public ArrayBase<T, 2>
{
#ifdef MOOSE_KOKKOS_SCOPE
  usingKokkosArrayBaseMembers(T, 2);
#endif

public:
  /**
   * Default constructor
   */
  Array() = default;
  /**
   * Copy constructor
   */
  Array(const Array<T, 2> & array) : ArrayBase<T, 2>(array) {}
  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  auto & operator=(const Array<T, 2> & array)
  {
    this->shallowCopy(array);

    return *this;
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Constructor
   * Initialize and allocate array with given dimensions
   * This allocates both host and device data
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   */
  Array(dof_id_type n0, dof_id_type n1) { create(n0, n1); }

  /**
   * Initialize array with given dimensions but do not allocate
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   */
  void init(dof_id_type n0, dof_id_type n1)
  {
    this->template createInternal<false, false>({n0, n1});
  }
  /**
   * Initialize and allocate array with given dimensions on host and device
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   */
  void create(dof_id_type n0, dof_id_type n1)
  {
    this->template createInternal<true, true>({n0, n1});
  }
  /**
   * Initialize and allocate array with given dimensions on host only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   */
  void createHost(dof_id_type n0, dof_id_type n1)
  {
    this->template createInternal<true, false>({n0, n1});
  }
  /**
   * Initialize and allocate array with given dimensions on device only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   */
  void createDevice(dof_id_type n0, dof_id_type n1)
  {
    this->template createInternal<false, true>({n0, n1});
  }
  /**
   * Set starting index offsets
   * @param d0 The first dimension offset
   * @param d1 The second dimension offset
   */
  void offset(dof_id_signed_type d0, dof_id_signed_type d1) { ArrayBase<T, 2>::offset({d0, d1}); }

  /**
   * Get an array entry
   * @param i0 The first dimension index
   * @param i1 The second dimension index
   * @returns The reference of the entry depending on the architecture this function is being called
   * on
   */
  KOKKOS_FUNCTION T & operator()(dof_id_signed_type i0, dof_id_signed_type i1) const
  {
    KOKKOS_ASSERT(i0 - _d[0] >= 0 && static_cast<dof_id_type>(i0 - _d[0]) < _n[0]);
    KOKKOS_ASSERT(i1 - _d[1] >= 0 && static_cast<dof_id_type>(i1 - _d[1]) < _n[1]);

    return this->operator[](i0 - _d[0] + (i1 - _d[1]) * _s[1]);
  }
#endif
};

template <typename T>
class Array<T, 3> : public ArrayBase<T, 3>
{
#ifdef MOOSE_KOKKOS_SCOPE
  usingKokkosArrayBaseMembers(T, 3);
#endif

public:
  /**
   * Default constructor
   */
  Array() = default;
  /**
   * Copy constructor
   */
  Array(const Array<T, 3> & array) : ArrayBase<T, 3>(array) {}
  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  auto & operator=(const Array<T, 3> & array)
  {
    this->shallowCopy(array);

    return *this;
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Constructor
   * Initialize and allocate array with given dimensions
   * This allocates both host and device data
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   */
  Array(dof_id_type n0, dof_id_type n1, dof_id_type n2) { create(n0, n1, n2); }

  /**
   * Initialize array with given dimensions but do not allocate
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   */
  void init(dof_id_type n0, dof_id_type n1, dof_id_type n2)
  {
    this->template createInternal<false, false>({n0, n1, n2});
  }
  /**
   * Initialize and allocate array with given dimensions on host and device
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   */
  void create(dof_id_type n0, dof_id_type n1, dof_id_type n2)
  {
    this->template createInternal<true, true>({n0, n1, n2});
  }
  /**
   * Initialize and allocate array with given dimensions on host only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   */
  void createHost(dof_id_type n0, dof_id_type n1, dof_id_type n2)
  {
    this->template createInternal<true, false>({n0, n1, n2});
  }
  /**
   * Initialize and allocate array with given dimensions on device only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   */
  void createDevice(dof_id_type n0, dof_id_type n1, dof_id_type n2)
  {
    this->template createInternal<false, true>({n0, n1, n2});
  }
  /**
   * Set starting index offsets
   * @param d0 The first dimension offset
   * @param d1 The second dimension offset
   * @param d2 The third dimension offset
   */
  void offset(dof_id_signed_type d0, dof_id_signed_type d1, dof_id_signed_type d2)
  {
    ArrayBase<T, 3>::offset({d0, d1, d2});
  }

  /**
   * Get an array entry
   * @param i0 The first dimension index
   * @param i1 The second dimension index
   * @param i2 The third dimension index
   * @returns The reference of the entry depending on the architecture this function is being
   * called on
   */
  KOKKOS_FUNCTION T &
  operator()(dof_id_signed_type i0, dof_id_signed_type i1, dof_id_signed_type i2) const
  {
    KOKKOS_ASSERT(i0 - _d[0] >= 0 && static_cast<dof_id_type>(i0 - _d[0]) < _n[0]);
    KOKKOS_ASSERT(i1 - _d[1] >= 0 && static_cast<dof_id_type>(i1 - _d[1]) < _n[1]);
    KOKKOS_ASSERT(i2 - _d[2] >= 0 && static_cast<dof_id_type>(i2 - _d[2]) < _n[2]);

    return this->operator[](i0 - _d[0] + (i1 - _d[1]) * _s[1] + (i2 - _d[2]) * _s[2]);
  }
#endif
};

template <typename T>
class Array<T, 4> : public ArrayBase<T, 4>
{
#ifdef MOOSE_KOKKOS_SCOPE
  usingKokkosArrayBaseMembers(T, 4);
#endif

public:
  /**
   * Default constructor
   */
  Array() = default;
  /**
   * Copy constructor
   */
  Array(const Array<T, 4> & array) : ArrayBase<T, 4>(array) {}
  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  auto & operator=(const Array<T, 4> & array)
  {
    this->shallowCopy(array);

    return *this;
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Constructor
   * Initialize and allocate array with given dimensions
   * This allocates both host and device data
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   */
  Array(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3) { create(n0, n1, n2, n3); }

  /**
   * Initialize array with given dimensions but do not allocate
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   */
  void init(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3)
  {
    this->template createInternal<false, false>({n0, n1, n2, n3});
  }
  /**
   * Initialize and allocate array with given dimensions on host and device
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   */
  void create(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3)
  {
    this->template createInternal<true, true>({n0, n1, n2, n3});
  }
  /**
   * Initialize and allocate array with given dimensions on host only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   */
  void createHost(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3)
  {
    this->template createInternal<true, false>({n0, n1, n2, n3});
  }
  /**
   * Initialize and allocate array with given dimensions on device only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   */
  void createDevice(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3)
  {
    this->template createInternal<false, true>({n0, n1, n2, n3});
  }
  /**
   * Set starting index offsets
   * @param d0 The first dimension offset
   * @param d1 The second dimension offset
   * @param d2 The third dimension offset
   * @param d3 The fourth dimension offset
   */
  void
  offset(dof_id_signed_type d0, dof_id_signed_type d1, dof_id_signed_type d2, dof_id_signed_type d3)
  {
    ArrayBase<T, 4>::offset({d0, d1, d2, d3});
  }

  /**
   * Get an array entry
   * @param i0 The first dimension index
   * @param i1 The second dimension index
   * @param i2 The third dimension index
   * @param i3 The fourth dimension index
   * @returns The reference of the entry depending on the architecture this function is being called
   * on
   */
  KOKKOS_FUNCTION T & operator()(dof_id_signed_type i0,
                                 dof_id_signed_type i1,
                                 dof_id_signed_type i2,
                                 dof_id_signed_type i3) const
  {
    KOKKOS_ASSERT(i0 - _d[0] >= 0 && static_cast<dof_id_type>(i0 - _d[0]) < _n[0]);
    KOKKOS_ASSERT(i1 - _d[1] >= 0 && static_cast<dof_id_type>(i1 - _d[1]) < _n[1]);
    KOKKOS_ASSERT(i2 - _d[2] >= 0 && static_cast<dof_id_type>(i2 - _d[2]) < _n[2]);
    KOKKOS_ASSERT(i3 - _d[3] >= 0 && static_cast<dof_id_type>(i3 - _d[3]) < _n[3]);

    return this->operator[](i0 - _d[0] + (i1 - _d[1]) * _s[1] + (i2 - _d[2]) * _s[2] +
                            (i3 - _d[3]) * _s[3]);
  }
#endif
};

template <typename T>
class Array<T, 5> : public ArrayBase<T, 5>
{
#ifdef MOOSE_KOKKOS_SCOPE
  usingKokkosArrayBaseMembers(T, 5);
#endif

public:
  /**
   * Default constructor
   */
  Array() = default;
  /**
   * Copy constructor
   */
  Array(const Array<T, 5> & array) : ArrayBase<T, 5>(array) {}
  /**
   * Shallow copy another Kokkos array
   * @param array The Kokkos array to be shallow copied
   */
  auto & operator=(const Array<T, 5> & array)
  {
    this->shallowCopy(array);

    return *this;
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Constructor
   * Initialize and allocate array with given dimensions
   * This allocates both host and device data
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   * @param n4 The fifth dimension size
   */
  Array(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3, dof_id_type n4)
  {
    create(n0, n1, n2, n3, n4);
  }

  /**
   * Initialize array with given dimensions but do not allocate
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   * @param n4 The fifth dimension size
   */
  void init(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3, dof_id_type n4)
  {
    this->template createInternal<false, false>({n0, n1, n2, n3, n4});
  }
  /**
   * Initialize and allocate array with given dimensions on host and device
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   * @param n4 The fifth dimension size
   */
  void create(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3, dof_id_type n4)
  {
    this->template createInternal<true, true>({n0, n1, n2, n3, n4});
  }
  /**
   * Initialize and allocate array with given dimensions on host only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   * @param n4 The fifth dimension size
   */
  void createHost(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3, dof_id_type n4)
  {
    this->template createInternal<true, false>({n0, n1, n2, n3, n4});
  }
  /**
   * Initialize and allocate array with given dimensions on device only
   * @param n0 The first dimension size
   * @param n1 The second dimension size
   * @param n2 The third dimension size
   * @param n3 The fourth dimension size
   * @param n4 The fifth dimension size
   */
  void createDevice(dof_id_type n0, dof_id_type n1, dof_id_type n2, dof_id_type n3, dof_id_type n4)
  {
    this->template createInternal<false, true>({n0, n1, n2, n3, n4});
  }
  /**
   * Set starting index offsets
   * @param d0 The first dimension offset
   * @param d1 The second dimension offset
   * @param d2 The third dimension offset
   * @param d3 The fourth dimension offset
   * @param d4 The fifth dimension offset
   */
  void offset(dof_id_signed_type d0,
              dof_id_signed_type d1,
              dof_id_signed_type d2,
              dof_id_signed_type d3,
              dof_id_signed_type d4)
  {
    ArrayBase<T, 5>::offset({d0, d1, d2, d3, d4});
  }

  /**
   * Get an array entry
   * @param i0 The first dimension index
   * @param i1 The second dimension index
   * @param i2 The third dimension index
   * @param i3 The fourth dimension index
   * @param i4 The fifth dimension index
   * @returns The reference of the entry depending on the architecture this function is being called
   * on
   */
  KOKKOS_FUNCTION T & operator()(dof_id_signed_type i0,
                                 dof_id_signed_type i1,
                                 dof_id_signed_type i2,
                                 dof_id_signed_type i3,
                                 dof_id_signed_type i4) const
  {
    KOKKOS_ASSERT(i0 - _d[0] >= 0 && static_cast<dof_id_type>(i0 - _d[0]) < _n[0]);
    KOKKOS_ASSERT(i1 - _d[1] >= 0 && static_cast<dof_id_type>(i1 - _d[1]) < _n[1]);
    KOKKOS_ASSERT(i2 - _d[2] >= 0 && static_cast<dof_id_type>(i2 - _d[2]) < _n[2]);
    KOKKOS_ASSERT(i3 - _d[3] >= 0 && static_cast<dof_id_type>(i3 - _d[3]) < _n[3]);
    KOKKOS_ASSERT(i4 - _d[4] >= 0 && static_cast<dof_id_type>(i4 - _d[4]) < _n[4]);

    return this->operator[](i0 - _d[0] + (i1 - _d[1]) * _s[1] + (i2 - _d[2]) * _s[2] +
                            (i3 - _d[3]) * _s[3] + (i4 - _d[4]) * _s[4]);
  }
#endif
};
///@}

template <typename T>
using Array1D = Array<T, 1>;
template <typename T>
using Array2D = Array<T, 2>;
template <typename T>
using Array3D = Array<T, 3>;
template <typename T>
using Array4D = Array<T, 4>;
template <typename T>
using Array5D = Array<T, 5>;

} // namespace Kokkos
} // namespace Moose
(framework/include/kokkos/base/KokkosArray.h)

Moose::Kokkos::Map is designed to be used for certain classes of data whose index-based storage using Moose::Kokkos::Array is difficult. It contains a CPU std::map and two GPU arrays for storing the keys and values, along with an offset array to store the starting location of each bucket. It uses the 32-bit FNV-1a hash algorithm on GPU to implement a hash table. The map can only be populated on CPU, and upon calling copyToDevice(), the corresponding hash table is created on GPU. It also provides copyToDeviceNested() for convenience in case the values are of type Moose::Kokkos::Array, which calls copyToDeviceNested() for each array entry.

The key/value lookup on GPU is index-based. find() returns the index of the value corresponding to the key or Moose::Kokkos::Map::invalid_id if the key does not exist. If the index is valid, you can get the value by using the index with value(). If you know in advance that a key always exists, you can simply call operator[] without going through those steps. However, if operator[] is called for a key that does not exist, the calculation will crash.

Listing 2: The Moose::Kokkos::Map source code.


#pragma once

#include "KokkosArray.h"

#include <memory>
#include <map>

namespace Moose
{
namespace Kokkos
{

#ifdef MOOSE_KOKKOS_SCOPE
constexpr uint32_t FNV_PRIME = 0x01000193;        // 16777619
constexpr uint32_t FNV_OFFSET_BASIS = 0x811C9DC5; // 2166136261

template <typename T>
KOKKOS_FUNCTION uint32_t
fnv1aHash(const T & key, uint32_t hash)
{
  auto bytes = reinterpret_cast<const uint8_t *>(&key);

  for (size_t i = 0; i < sizeof(T); ++i)
  {
    hash ^= bytes[i];
    hash *= FNV_PRIME;
  }

  return hash;
}

template <typename T>
KOKKOS_FUNCTION uint32_t
fnv1aHash(const T & key)
{
  return fnv1aHash(key, FNV_OFFSET_BASIS);
}

template <typename T1, typename T2>
struct Pair;

template <typename T1, typename T2>
KOKKOS_FUNCTION uint32_t
fnv1aHash(const Pair<T1, T2> & key)
{
  return fnv1aHash(key.second, fnv1aHash(key.first));
}
#endif

template <typename T1, typename T2>
class Map;

template <typename T1, typename T2>
void dataStore(std::ostream & stream, Map<T1, T2> & map, void * context);
template <typename T1, typename T2>
void dataLoad(std::istream & stream, Map<T1, T2> & map, void * context);

/**
 * The Kokkos wrapper class for standard map.
 * The map can only be populated on host.
 * Make sure to call copyToDevice() or copyToDeviceNested() after populating the map on host.
 */
template <typename T1, typename T2>
class Map
{
public:
  /**
   * Get the beginning writeable iterator of the host map
   * @returns The beginning iterator
   */
  auto begin() { return get().begin(); }
  /**
   * Get the beginning const iterator of the host map
   * @returns The beginning iterator
   */
  auto begin() const { return get().cbegin(); }
  /**
   * Get the end writeable iterator of the host map
   * @returns The end iterator
   */
  auto end() { return get().end(); }
  /**
   * Get the end const iterator of the host map
   * @returns The end iterator
   */
  auto end() const { return get().cend(); }
  /**
   * Get the underlying writeable host map
   * @returns The writeable host map
   */
  auto & get()
  {
    if (!_map_host)
      _map_host = std::make_shared<std::map<T1, T2>>();

    return *_map_host;
  }
  /**
   * Get the underlying const host map
   * @returns The const host map
   */
  const auto & get() const
  {
    if (!_map_host)
      _map_host = std::make_shared<std::map<T1, T2>>();

    return *_map_host;
  }
  /**
   * Call host map's operator[]
   * @param key The key
   * @returns The writeable reference of the value
   */
  T2 & operator[](const T1 & key) { return get()[key]; }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Copy host map to device
   */
  void copyToDevice();
  /**
   * Copy host map to device, perform nested copy for Kokkos arrays
   */
  void copyToDeviceNested();
  /**
   * Swap with another Kokkos map
   * @param map The Kokkos map to be swapped
   */
  void swap(Map<T1, T2> & map);

  /**
   * Get the size of map
   * @returns The size of map
   */
  KOKKOS_FUNCTION dof_id_type size() const
  {
    KOKKOS_IF_ON_HOST(return get().size();)

    return _keys.size();
  }
  /**
   * Find the index of a key
   * @param key The key
   * @returns The index of the key, invalid_id if the key does not exist
   */
  KOKKOS_FUNCTION dof_id_type find(const T1 & key) const;
  /**
   * Get whether a key exists
   * @param key The key
   * @returns Whether the key exists
   */
  KOKKOS_FUNCTION bool exists(const T1 & key) const { return find(key) != invalid_id; }
  /**
   * Get the key of an index
   * @param idx The index returned by find()
   * @returns The const reference of the key
   */
  KOKKOS_FUNCTION const T1 & key(dof_id_type idx) const
  {
    KOKKOS_ASSERT(idx != invalid_id);

    return _keys[idx];
  }
  /**
   * Get the value of an index
   * @param idx The index returned by find()
   * @returns The const reference of the value
   */
  KOKKOS_FUNCTION const T2 & value(dof_id_type idx) const
  {
    KOKKOS_ASSERT(idx != invalid_id);

    return _values[idx];
  }
  /**
   * Get the value corresponding to a key
   * @param key The key
   * @returns The const reference of the value
   */
  ///@{
  KOKKOS_FUNCTION const T2 & operator[](const T1 & key) const
  {
    KOKKOS_IF_ON_HOST(return get().at(key);)

    auto idx = find(key);

    KOKKOS_ASSERT(idx != invalid_id);

    return _values[idx];
  }
  // Due to a stupid NVCC compiler bug, one cannot do var[i][j] for a variable whose type is
  // Array<Map<...>> (the second operator[] seems to be interpreted as if it is the operator of
  // Array), while var[i](j) works. Until we figure out what is going on, one can use the following
  // operator as a workaround.
  KOKKOS_FUNCTION const T2 & operator()(const T1 & key) const { return operator[](key); }
  ///@}
#endif

  static const dof_id_type invalid_id = libMesh::DofObject::invalid_id;

private:
#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Internal method to copy host map to device
   */
  void copy();
#endif

  /**
   * Standard map on host
   * Stored as a shared pointer to avoid deep copy
   */
  mutable std::shared_ptr<std::map<T1, T2>> _map_host;
  /**
   * Keys on device
   */
  Array<T1> _keys;
  /**
   * Values on device
   */
  Array<T2> _values;
  /**
   * Beginning offset into device arrays of each bucket
   */
  Array<dof_id_type> _offset;

  friend void dataStore<T1, T2>(std::ostream &, Map<T1, T2> &, void *);
  friend void dataLoad<T1, T2>(std::istream &, Map<T1, T2> &, void *);
};

#ifdef MOOSE_KOKKOS_SCOPE
template <typename T1, typename T2>
void
Map<T1, T2>::copy()
{
  _keys.create(size());
  _values.create(size());
  _offset.create(size() + 1);
  _offset = 0;

  for (const auto & [key, value] : get())
  {
    auto bucket = fnv1aHash(key) % size();

    _offset[bucket]++;
  }

  std::exclusive_scan(_offset.begin(), _offset.end(), _offset.begin(), 0);

  _offset.copyToDevice();

  std::vector<dof_id_type> idx(size(), 0);

  for (const auto & [key, value] : get())
  {
    auto bucket = fnv1aHash(key) % size();

    _keys[_offset[bucket] + idx[bucket]] = key;
    _values[_offset[bucket] + idx[bucket]] = value;
    idx[bucket]++;
  }
}

template <typename T1, typename T2>
void
Map<T1, T2>::copyToDevice()
{
  copy();

  _keys.copyToDevice();
  _values.copyToDevice();
}

template <typename T1, typename T2>
void
Map<T1, T2>::copyToDeviceNested()
{
  copy();

  _keys.copyToDeviceNested();
  _values.copyToDeviceNested();
}

template <typename T1, typename T2>
void
Map<T1, T2>::swap(Map<T1, T2> & map)
{
  get().swap(map.get());
  _keys.swap(map._keys);
  _values.swap(map._values);
  _offset.swap(map._offset);
}

template <typename T1, typename T2>
KOKKOS_FUNCTION dof_id_type
Map<T1, T2>::find(const T1 & key) const
{
  if (!size())
    return invalid_id;

  auto bucket = fnv1aHash(key) % size();
  auto begin = _offset[bucket];
  auto end = _offset[bucket + 1];

  for (dof_id_type i = begin; i < end; ++i)
    if (_keys[i] == key)
      return i;

  return invalid_id;
}

template <typename T1, typename T2>
void
dataStore(std::ostream & stream, Map<T1, T2> & map, void * context)
{
  using ::dataStore;

  dataStore(stream, map.get(), context);
  dataStore(stream, map._keys, context);
  dataStore(stream, map._values, context);
  dataStore(stream, map._offset, context);
}

template <typename T1, typename T2>
void
dataLoad(std::istream & stream, Map<T1, T2> & map, void * context)
{
  using ::dataLoad;

  dataLoad(stream, map.get(), context);
  dataLoad(stream, map._keys, context);
  dataLoad(stream, map._values, context);
  dataLoad(stream, map._offset, context);
}
#endif

} // namespace Kokkos
} // namespace Moose
(framework/include/kokkos/base/KokkosMap.h)
commentnote

If you need to use keys of a custom data type, a specialization of the template function Moose::Kokkos::fnv1aHash should be provided for the custom type.

commentnote

Moose::Kokkos::Map has not been optimized for performance. Beware of the performance impact from using it.

commentnote

All the creation and copy APIs of Moose::Kokkos::Array and Moose::Kokkos::Map can only be called by CPU.

Separate Execution Space

Due to different architectures, a GPU function needs to be marked as KOKKOS_FUNCTION; if a function is not marked as KOKKOS_FUNCTION, it is not callable on GPU. KOKKOS_FUNCTION instructs the compiler to generate both CPU and GPU versions of a function; namely, the function can also be called on CPU. If the function needs to behave differently on CPU and GPU, you can use KOKKOS_IF_ON_HOST and KOKKOS_IF_ON_DEVICE to dictate the behaviors separately for CPU and GPU in the same function. For example:


KOKKOS_FUNCTION void * MyClass::getPointer() const
{
  KOKKOS_IF_ON_HOST(return _cpu_pointer;)
  KOKKOS_IF_ON_DEVICE(return _gpu_pointer;)
}

It can also be leveraged to throw an error on the CPU side if the function is intended to be GPU only.

Currently, it is required that all GPU functions should be inlineable; namely, the callee should be defined in the same source file with the caller or defined in a header file that can be seen by the caller. The relocatable device code (RDC) feature that allows dynamic linking of GPU functions is currently disabled due to several limitations imposed by PETSc and Kokkos. In fact, inlining functions has a non-negligible performance advantage on GPU and is encouraged when possible. However, it can result in excessive number of functions being defined in header files, which reduces code reusability, modularity, and encapsulation and increases dependencies and compilation time. Therefore, enabling RDC in Kokkos-MOOSE is one of the items with high priority.

Dynamic Allocation

Dynamic allocation inside a GPU function is prohibitively expensive because of the massive parallelism of GPU. It is a common practice for CPU programming to dynamically allocate small chunks of memory on-the-fly depending on the needs, but doing so on GPU results in hundreds of thousands of threads performing dynamic allocation simultaneously which are all serialized due to the sequential nature of heap memory allocation. A way to circumvent this is to preallocate a large GPU array on CPU for all threads and assign a chunk of the array to each thread, but it is very cumbersome and often impossible. Also, such approach can result in allocating an unnecessarily large memory, as the total number of threads can outweigh the number of physically active threads at a time. Another way is to define a fixed-sized local array in a GPU function. For local arrays whose sizes are small and predictable, this approach gives the best performance and is encouraged. However, it will be difficult to set the size in advance if the size can vary significantly, and the size will likely have to be set excessively large.

Therefore, we provide a memory pool algorithm for this purpose that can assign non-overlapping memory chunks to threads and reuse freed chunks in parallel. A memory pool can be allocated through MooseApp::allocateKokkosMemoryPool(). It requires two arguments: the total memory pool size in bytes and the number of ways. The number of ways indicates how many parallel pools will be created. The total pool size is equally divided by each pool; for instance, if the total pool size is 1 GB and there are 1,000 pools, the size of each pool will be 1 MB. Having more parallel pools reduces the chance of conflicts between threads in each pool and thereby reduces the performance penalty, but at the same time it increases the chance of pool overflow when the requested chunk sizes of threads are not properly balanced. A performance study indicated that at least thousands of parallel pools are required to minimize performance hits.

The memory pool can be accessed on GPU by calling kokkosMemoryPool(), which is available through inheriting the Moose::Kokkos::MemoryPoolHolder interface class. You can obtain a temporary chunk object by calling allocate() on the memory pool. It requires a template argument that specifies the data type and two function arguments: an arbitrary index and the size of chunk in the number of elements. The index is used to determine the parallel pool index by performing a simple modulo operation on it. Then, you can obtain the pointer to the assigned memory by calling get() on the chunk object. The allocated chunk is freed when the temporary chunk object is destroyed. Therefore, the temporary chunk object should be alive while the memory is under use. The following example illustrates obtaining a memory chunk for 16 int values using the thread index:


auto chunk = kokkosMemoryPool().allocate<int>(tid, 16);
int * memory = chunk.get();
commentnote

For performance, try to minimize requesting chunks and reuse the same chunk as much as possible.

Functor and Copy Constructor

Every parallel Kokkos object is a functor which is a class with one or more overloaded operator()s, and they define the parallel work. Kokkos launches a parallel calculation by dispatching the parallel work of a functor to GPU, and a single copy of the functor is created and copied to GPU during the dispatch. Namely, the copy constructor of the functor is invoked at every dispatch. You can leverage this and define a copy constructor of their object to perform synchronizations between CPU and GPU data that need to be done at every dispatch, if there is any. Also, because there is only a single instance of the functor on GPU, the member variables of the functor are shared by all threads. Therefore, member variables should not be used to store thread-private data. In addition, the functor is dispatched as a const object, which also copies all the member variables as const. This requires the member functions of the member variables and the functor itself to be const functions if they are to be called on GPU.

Since the dispatching invokes the copy constructor of the functor, Moose::Kokkos::Array is designed to be shallow-copied by default to avoid unnecessary memory copies during the functor dispatch. Namely, its copy constructor only performs a shallow copy and does not invoke the copy constructors of its entries. If it is required to explicitly invoke the copy constructor of each entry for a certain data type used in an array, you should define a specialization of the Moose::Kokkos::ArrayDeepCopy type trait template as follows:


namespace Moose
{
namespace Kokkos
{
template <>
struct ArrayDeepCopy<SomeType>
{
  static const bool value = true;
};
}
}
commentnote

The type trait alters the behavior of Moose::Kokkos::Map as well which is also shallow-copied by default, because it uses Moose::Kokkos::Array internally for storing the values.

Value Binding

Typically, variables retrieved through MOOSE APIs are bound locally in MOOSE objects as references to ensure they always see up-to-date variables. However, this reference binding occurs on CPU during object construction, and the reference binds a CPU variable. As the result, the reference is not accessible on GPU due to the separate memory space, and thus binding variables for GPU should not be done using references. Instead, they should be copied to concrete instances (values) so that their values are copied to the GPU clone during the functor dispatch, which can be described as value binding.

With the value binding, however, maintaining the synchronization between the original variables and their local bindings is difficult. Therefore, we provide a template class Moose::Kokkos::ReferenceWrapper that can hold the reference of a variable on CPU and provide access to the up-to-date value of the variable on GPU. It holds the reference of the CPU variable and an instance of the variable at the same time and leverages the copy constructor as the hook to synchronize the two, given that it is invoked at every functor dispatch. Namely, the copy constructor copies the reference to the instance, which guarantees that the instance has the value of the variable at the moment of functor dispatch. The wrapper automatically returns the reference on CPU and the instance on GPU, depending on where it is being accessed.

For arithmetic values, there exists Moose::Kokkos::Scalar which is a derived class from Moose::Kokkos::ReferenceWrapper and provides arithmetic operators that can directly operate to the stored value. Moose::Kokkos::PostprocessrValue is an alias for Moose::Kokkos::Scalar<PostprocessorValue>.

Listing 3: The Moose::Kokkos::ReferenceWrapper source code.


#pragma once

#ifdef MOOSE_KOKKOS_SCOPE
#include "KokkosHeader.h"
#endif

#include "MooseTypes.h"

namespace Moose
{
namespace Kokkos
{

/**
 * The Kokkos object that can hold the reference of a variable.
 * Reference of a host variable is not accessible on device, so if there is a variable that should
 * be stored as a reference but still needs to be accessed on device, define an instance of this
 * class and construct it with the reference of the variable.
 * This class holds the device copy as well as the host reference of the variable.
 * The copy constructor of this object that copies the host reference to the device copy is invoked
 * whenever a Kokkos functor containing this object is dispatched to device, so it is guaranteed
 * that the device copy is always up-to-date with the host reference when it is used on device
 * Therefore, the variable must be copy constructible.
 */
template <typename T>
class ReferenceWrapper
{
public:
  /**
   * Constructor
   * @param reference The writeable reference of the variable to store
   */
  ReferenceWrapper(T & reference) : _reference(reference), _copy(reference) {}
  /**
   * Copy constructor
   */
  ReferenceWrapper(const ReferenceWrapper<T> & object)
    : _reference(object._reference), _copy(object._reference)
  {
  }

#ifdef MOOSE_KOKKOS_SCOPE
  /**
   * Get the const reference of the stored variable
   * @returns The const reference of the stored variable depending on the architecture this function
   * is being called on
   */
  KOKKOS_FUNCTION operator const T &() const
  {
    KOKKOS_IF_ON_HOST(return _reference;)

    return _copy;
  }
  /**
   * Get the const reference of the stored variable
   * @returns The const reference of the stored variable depending on the architecture this function
   * is being called on
   */
  KOKKOS_FUNCTION const T & operator*() const
  {
    KOKKOS_IF_ON_HOST(return _reference;)

    return _copy;
  }
  /**
   * Get the const pointer to the stored variable
   * @returns The const pointer to the stored variable depending on the architecture this function
   * is being called on
   */
  KOKKOS_FUNCTION const T * operator->() const
  {
    KOKKOS_IF_ON_HOST(return &_reference;)

    return &_copy;
  }
  /**
   * Forward arguments to the stored variable's const operator() depending on the architecture this
   * function is being called on
   * @param args The variadic arguments to be forwarded
   */
  template <typename... Args>
  KOKKOS_FUNCTION auto operator()(Args &&... args) const -> decltype(auto)
  {
    KOKKOS_IF_ON_HOST(return _reference(std::forward<Args>(args)...);)

    return _copy(std::forward<Args>(args)...);
  }
#else
  /**
   * Get the const reference of the stored host reference
   * @returns The const reference of the stored host reference
   */
  operator const T &() const { return _reference; }
  /**
   * Get the const reference of the stored host reference
   * @returns The const reference of the stored host reference
   */
  const T & operator*() const { return _reference; }
  /**
   * Get the const pointer of the stored host reference
   * @returns The const pointer to the stored host reference
   */
  const T * operator->() const { return &_reference; }
  /**
   * Forward arguments to the stored host reference's const operator()
   * @param args The variadic arguments to be forwarded
   */
  template <typename... Args>
  auto operator()(Args &&... args) const -> decltype(auto)
  {
    return _reference(std::forward<Args>(args)...);
  }
#endif
  /**
   * Get the writeable reference of the stored host reference
   * @returns The writeable reference of the stored host reference
   */
  operator T &() { return _reference; }
  /**
   * Get the writeable reference of the stored host reference
   * @returns The writeable reference of the stored host reference
   */
  T & operator*() { return _reference; }
  /**
   * Get the writeable pointer of the stored host reference
   * @returns The writeable pointer to the stored host reference
   */
  T * operator->() { return &_reference; }
  /**
   * Forward arguments to the stored host reference's operator()
   * @param args The variadic arguments to be forwarded
   */
  template <typename... Args>
  auto operator()(Args &&... args) -> decltype(auto)
  {
    return _reference(std::forward<Args>(args)...);
  }

protected:
  /**
   * Writeable host reference of the variable
   */
  T & _reference;
  /**
   * Device copy of the variable
   */
  const T _copy;
};

} // namespace Kokkos
} // namespace Moose
(framework/include/kokkos/base/KokkosReferenceWrapper.h)
commentnote

The data type should be copy-constructable.

Static Polymorphism with Curiously Recurring Template Pattern (CRTP)

The primary challenge in porting MOOSE to GPU lies in its heavy reliance on dynamic polymorphism using virtual functions. Polymorphism is the centerpiece of the template method pattern, which is a behavioral design pattern in object-oriented programming that establishes the skeleton of an algorithm in a base class while permitting derived classes to override certain steps without altering the algorithm’s overall structure, and it is the key design pattern of MOOSE.

However, virtual functions are not well-supported by GPU backends. Virtual functions are implemented through virtual function tables (vtable), which are the tables of function pointers implicitly generated by the compiler inside an object to be looked up at runtime. A vtable is created whenever an object has virtual functions and is populated by the constructor of the object. Because of this, using virtual functions on GPU requires the object to be instantiated on GPU so that the vtable can be populated with GPU function pointers. It is cumbersome to implement within the current MOOSE system, because an object instantiated on GPU is not accessible from CPU. Furthermore, not every GPU backend supports virtual functions.

Aside from portability, using virtual functions on GPU should be avoided if possible for performance, especially when the virtual functions are called in critical paths. The vtable lookup itself incurs overheads, and using function pointers prevents inlining. GPU compilers heavily rely on the inlining to generate an optimized code, and being unable to inline functions will likely lead to a performance hit.

Therefore, in order to keep the design pattern of MOOSE while enabling GPU acceleration, the CRTP was adopted. The Kokkos version of every MOOSE object is implemented using the CRTP. The CRTP is a programming idiom that involves a class template inheriting from a template instantiation of itself, which is a technique used to achieve static (compile-time) polymorphism. The following pseudo-codes demonstrate a typical template method pattern implemented with the dynamic polymorphism and its equivalent implementation with the static polymorphism using the CRTP:

  • Dynamic polymorphism


class Base{
public:
  void compute()
  {
    implementation();
  }
  virtual void implementation() { ... }
};

class Derived : public Base
{
public:
  virtual void implementation() override { ... }
};
  • Static polymorphism using the CRTP


template <typename Derived>
class Base
{
public:
  void compute()
  {
    static_cast<Derived *>(this)->implementation();
  }
  void implementation() { ... }
};

class Derived : public Base<Derived>
{
public:
  void implementation() { ... } // Hides implementation() of Base
};

In the pseudo-codes, compute is the template method that provides the overall structure of the algorithm, and implementation is the hook method that derived classes override to implement specific steps of the algorithm. Calling the derived class method is resolved at runtime in the dynamic polymorphism, whereas in the CRTP, the derived class type is known at compile time through the template argument, which makes it possible to explicitly cast this pointer to the derived class type to call the derived class method. Overriding a virtual function of the base class is mimicked in the CRTP by hiding the function of the base class by redefining it in the derived class. If the function is not hidden by the derived class, the base class function will be simply called which is available through inheritance. The behavior of a pure virtual function can also be reproduced by not defining the default function in the base class, so that if you do not define it in the derived class, the compiler will detect it at the call to the derived class function in the base class. One difference is that in the dynamic polymorphism, the base class can be instantiated by itself unless the function is defined as a pure virtual function, while the self-instantiation is inherently impossible in the CRTP.

The CRTP, however, has several caveats that require extra cautions:

  • Inability of Checking Erroneous Function Hiding

The validity of hiding a function that has a default definition in the base class cannot be checked during compile time. While it will be able to detect incorrect argument type or count to some extent, mistakes such as misspelled function name will result in silently leaving the function unused. Unlike in dynamic polymorphism where inconsistent overriding can be detected during compile time by explicitly adding the override keyword, it cannot be prevented by the compiler.

  • Inability of Function Privatization

In the CRTP, the base class calls the derived class method by casting this pointer to the derived class type and explicitly calling the method. As the result, the method being called should be public, which is different from the common practice of encapsulation.

  • Complexity of Multiple Inheritance

When there are multiple levels of inheritance, all the intermediate classes should be template classes as well. Because the class type of the final derived class should be seen by the base class through the template argument, the intermediate classes should relay the class type to the base class. If you accidentally derive a class from a non-template class, the base class will not be able to see the derived class. You are encouraged to prevent unintended inheritance by explicitly adding the final keyword to the last level derived class.

Separate Compilation

Last but not the least, all Kokkos codes should only be included in the source files with a special *.K extension. The Kokkos-MOOSE build system separates the compilation of the main MOOSE and Kokkos codes, avoiding the need to compile the entire MOOSE package with a GPU compiler and making the compilation process modular. All the Kokkos source files across the framework, modules, and apps are selectively compiled by the GPU compiler and linked into shared libraries by the GPU linker, while the rest of source files are compiled by the standard CPU compiler. The Kokkos shared libraries are then linked to the final executable by the CPU linker. The procedure is schematically shown in the following figure. When the Kokkos capabilities are disabled, the Kokkos source files are simply ignored, making it portable to non-GPU systems.

Figure 1: Schematics of the separate compilation of Kokkos source files.

In case Kokkos-related codes (not Kokkos codes) should be added to non-Kokkos source files, the build system conditionally defines a global preprocessor MOOSE_KOKKOS_ENABLED when the Kokkos capabilities are enabled. It can be used to protect the Kokkos-related codes in non-Kokkos source files. Also, the build system defines another preprocessor MOOSE_KOKKOS_SCOPE that is only defined for the Kokkos source files. It is useful for protecting Kokkos codes in header files that can be included in both Kokkos and non-Kokkos source files. However, it is important to note that MOOSE_KOKKOS_SCOPE should not be used to protect member variables or virtual member functions in a class, as it can make CPU and GPU compiler see inconsistent class definitions having different sizes and cause memory misalignment. It is primarly used to protect inlined GPU functions with the KOKKOS_FUNCTION specifier in header files or non-virtual CPU functions that are not allowed to be called in non-Kokkos source files.