GenericIO
GenericIO is a write-optimized library for writing self-describing scientific data files on large-scale parallel file systems.
Reference
Habib, et al., HACC: Simulating Future Sky Surveys on State-of-the-Art Supercomputing Architectures, New Astronomy, 2015 (http://arxiv.org/abs/1410.2805).
Obtaining the Source Code
The most recent version of source is available by cloning this repo:
git clone https://xgitlab.cels.anl.gov/hacc/genericio.git
There is also a history of code releases: 2019-04-17 / 2017-09-25 / 2016-08-29 / 2016-04-12 / 2015-06-08 /
Building Executables / C++Library
The executables and libgenericio
can be built either with
CMake (minimum version 3.10) or with
GNUMake. The following executables will
be built:
-
frontend/GenericIOPrint
print data to stdout (non-MPI version) -
frontend/GenericIOVerify
verify and try reading data (non-MPI version) -
mpi/GenericIOBenchmarkRead
reading benchmark, works on data written withGenericIOBenchmarkWrite
-
mpi/GenericIOBenchmarkWrite
writing benchmark -
mpi/GenericIOPrint
print data to stdout -
mpi/genericIORewrite
rewrite data with a different number of ranks -
mpi/genericIOVerify
verify and try reading data
Using CMake
Note that the executables / libraries will be located in
build/<frontend/mpi>
. CMake will use the compiler pointed to in the CC
and CXX
environmental variables.
mkdir build && cd build
cmake ..
make -j4
Using Make
Make will create the executables / libraries under the main directory. Edit the
CC
, CXX
, MPICC
, and MPICXX
variables in the GNUmakefile to
change the compiler.
make
Installing the Python Library
The pygio
library is pip-installable and works with mpi4py
.
Requirements
Currently, a CMake version >= 3.11.0 is required to fetch dependencies
during configuration. The pygio
module also requires MPI libraries to be
findable by CMake's FindMPI. The compiler needs to support C++17 (make sure
that CC
and CXX
point to the correct compiler)
Install
The python library can be installed by running pip in the main folder:
pip install .
It will use the compiler referred by the CC
and CXX
environment
variable. If the compiler supports OpenMP, the library will be threaded. Make
sure to set OMP_NUM_THREADS
to an appropriate variable, in particluar when
using multiple MPI ranks per node.
Output file partitions (subfiles)
If you're running on an IBM BG/Q supercomputer, then the number of subfiles (partitions) chosen is based on the I/O nodes in an automatic way. Otherwise, by default, the GenericIO library picks the number of subfiles based on a fairly-naive hostname-based hashing scheme. This works reasonably-well on small clusters, but not on larger systems. On a larger system, you might want to set these environmental variables:
GENERICIO_PARTITIONS_USE_NAME=0
GENERICIO_RANK_PARTITIONS=256
Where the number of partitions (256 above) determines the number of subfiles used. If you're using a Lustre file system, for example, an optimal number of files is:
# of files * stripe count ~ # OSTs
On Titan, for example, there are 1008 OSTs, and a default stripe count of 4, so we use approximately 256 files.