Description of the Ophidia big data stack
Ophidia is a research project on big data analytics for eScience. It provides a framework for parallel I/O and data analysis, an array-based storage model and a hierarchical storage organization to partition and distribute multidimensional scientific datasets ("datacubes"). It can be exploited in different scientific domains and with very heterogeneous sets of data.
In the context of the INDIGO-DataCloud project, Ophidia is being extended to support additional workflow interfaces (massive, for, parallel and new features. Ophidia is being exploited in several use cases, jointly with other components like Kepler, IAM, FutureGateway, Orchestrator.
Ophidia provides a server-side, parallel, declarative and extensible framework for eScience.
Designed for eScience. The n-dimensionality of scientific datasets requires tools that support specific data types (e.g. arrays) and primitives (e.g. slicing, dicing, pivoting, drill-down, roll-up) to properly enable data access and analysis. With regard to general-purpose analytics multidimensional systems, scientific data has a higher computing demand, which definitely leads to the need of having efficient parallel/distributed solutions to meet the (near) real-time analytics requirements. Ophidia supports different data formats (e.g. NetCDF, GeoTIFF, CSV) which allows managing data in different scientific domains.
Server-side approach. The current workflow exploited by most scientists is based on the sequence of steps “search, locate, download, and analyze”. This workflow will not be feasible at large scale and will fail for several reasons like: (i) ever-larger scientific datasets, (ii) time- and resource- consuming data downloads, and (iii) increased problem size and complexity requiring bigger computing facilities. At large scale, scientific discovery will strongly need to rely on data-intensive facilities close to data storage, parallel I/O systems and server-side analysis capabilities. Such an approach will move the analysis (and complexity) from the user’s desktop to the data centers and accordingly will change the infrastructure focus from data sharing to data analysis.
Parallel approach. The Ophidia analytics platform provides several MPI-based parallel operators to manipulate (as a whole) the entire set of fragments associated to a data cube. Some relevant examples include: datacube sub-setting (slicing and dicing), datacube aggregation, array-based primitives at the datacube level, datacube duplication, datacube pivoting, and NetCDF file import and export. To address scalability and enable parallelism, from a physical point of view, a data cube in Ophidia is then horizontally split into several chunks (called fragments) that are distributed across multiple I/O nodes. Each I/O node hosts a set of I/O servers optimized to manage n-dimensional arrays. Each I/O server, in turn, manages a set of databases consisting of one or more fragments. As it can be easily argued, tuning the levels in this hierarchy can also affect performance. For a specific data cube, the higher the product of the four levels is, the smaller the size of each fragment will be.
Declarative. In computer science, declarative programming is a programming paradigm, a style of building the structure and elements of computer programs, that expresses the logic of a computation without describing its control flow. Ophidia applies this style to minimize or eliminate side effects by describing what the system should accomplish in terms of the problem domain, rather than describing how to go about accomplishing it as a sequence of the programming language primitives (the how is left up to the platform’s implementation). Moreover Ophidia provides support for complex workflows / operational chains, with a specific syntax and a dedicated scheduler to exploit inter- and intra-task parallelism.
Extensible. Ophidia provides a large set of operators (50+) and primitives (100+) covering various types of data manipulations, such as sub-setting (slicing and dicing), data reductions, duplication, pivoting, and file import and export. However the framework is highly customizable: there is a minimal set of APIs through which it is possible to develop your own operator or primitive to implement and provide new algorithms and functionalities.