# GenericIO

GenericIO is a write-optimized library for writing self-describing scientific
data files on large-scale parallel file systems.

## Reference

Habib, et al., HACC: Simulating Future Sky Surveys on State-of-the-Art
Supercomputing Architectures, New Astronomy, 2015
(http://arxiv.org/abs/1410.2805).

## Obtaining the Source Code

The most recent version of source is available by cloning this repo:
```bash
git clone https://xgitlab.cels.anl.gov/hacc/genericio.git
```

There is also a history of code
[releases](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases):
[2019-04-17](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases/20190417) /
[2017-09-25](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases/20170925) /
[2016-08-29](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases/20160829) /
[2016-04-12](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases/20160412) /
[2015-06-08](https://xgitlab.cels.anl.gov/hacc/genericio/-/releases/20150608) /

-----

## Building Executables / C++Library

The executables and  ``libgenericio`` can be built either with
[CMake](https://cmake.org/) (minimum version 3.10) or with
[GNUMake](https://www.gnu.org/software/make/). The following executables will
be built:

- ``frontend/GenericIOPrint`` print data to stdout (non-MPI version)
- ``frontend/GenericIOVerify`` verify and try reading data (non-MPI version)
- ``mpi/GenericIOBenchmarkRead`` reading benchmark, works on data written with ``GenericIOBenchmarkWrite``
- ``mpi/GenericIOBenchmarkWrite`` writing benchmark
- ``mpi/GenericIOPrint`` print data to stdout
- ``mpi/genericIORewrite`` rewrite data with a different number of ranks
- ``mpi/genericIOVerify`` verify and try reading data

**Using CMake**

Note that the executables / libraries will be located in
``build/<frontend/mpi>``. CMake will use the compiler pointed to in the ``CC``
and ``CXX`` environmental variables.

```bash
mkdir build && cd build
cmake ..
make -j4
```

**Using Make**

Make will create the executables / libraries under the main directory. Edit the
``CC``, ``CXX``, ``MPICC``, and ``MPICXX`` variables in the GNUmakefile to
change the compiler.

```bash
make
```

## Installing the Python Library

The `pygio` library is pip-installable and works with `mpi4py`.

**Requirements**

Currently, a **CMake version >= 3.11.0** is required to fetch dependencies
during configuration. The ``pygio`` module also requires MPI libraries to be
findable by CMake's FindMPI. The compiler needs to support **C++17** (make sure
that ``CC`` and ``CXX`` point to the correct compiler)

**Install**

The python library can be installed by running pip in the **main folder**:
```bash
pip install .
```

It will use the compiler referred by the ``CC`` and ``CXX`` environment
variable. If the compiler supports OpenMP, the library will be threaded. Make
sure to set ``OMP_NUM_THREADS`` to an appropriate variable, in particluar when
using multiple MPI ranks per node.

-----

## Output file partitions (subfiles)

If you're running on an IBM BG/Q supercomputer, then the number of subfiles
(partitions) chosen is based on the I/O nodes in an automatic way. Otherwise, by
default, the GenericIO library picks the number of subfiles based on a
fairly-naive hostname-based hashing scheme. This works reasonably-well on small
clusters, but not on larger systems. On a larger system, you might want to set
these environmental variables:

```bash
GENERICIO_PARTITIONS_USE_NAME=0
GENERICIO_RANK_PARTITIONS=256
```

Where the number of partitions (256 above) determines the number of subfiles
used. If you're using a Lustre file system, for example, an optimal number of
files is:

```
# of files * stripe count  ~ # OSTs
```

On Titan, for example, there are 1008 OSTs, and a default stripe count of 4, so
we use approximately 256 files.