Commit 497aa9c3 authored by Carlos H. Brandt's avatar Carlos H. Brandt
Browse files

Wrap up archive section in data store doc

parent b2d3fd6c
## Data Store
# Data Store
In this document we discuss the structure of the data archive and metadata database
necessary not only to keep queries and access to data efficient but also to store
all the _history_ data products have gone through since the original/source retrieval.
The bulk data storage in our archives/database is meant to support the services being
provided, the _reduced_ data.
The bulk of data stored in our archives/database is meant to support the services being
provided, _i.e._, to store the _reduced_ data.
Before it, _upstream_ data and intermediary states are temporary, kept only
enough to succeed in creating the _reduced_ version.
Service _products_ are kept for the long term either for reuse by our services
......@@ -15,17 +15,69 @@ or for direct access and download.
### Archive
The data (files) are stored in a filesystem (or object store) according to their
processing level and their file format; We can (and do) have multiple file formats
for the same (processing) level (_e.g._, GeoTIFF and ISIS cubes)
processing level and their file format.
Multiple file formats can be associated to the same product (processing) level.
For example, reduced images are to be stored as GeoTIFF, ISIS cube, and JPEG.
The processing _level_ of a data product works like software versioning as it
supersedes the previous version.
In our archive, once _reduced/level-1_ data is processed and stored properly,
_upstream/level-0_ data can be deleted.
> The Pre-processing logfile, containing _all_ the steps and _metadata_ necessary
> to re-produce the same -- from "upstream" to "reduced" -- data product.
Each data product, at each given level, may have one or more storage format or
even data access services associated to (_e.g._ OGC WMS).
Putting it in a diagram/panel for better visualization, we would have:
<th rowspan=2>level-0 (upstream)</td>
<th rowspan="3">level-1 (reduced)</td>
<th rowspan="2">level-2 (product)</td>
Such multiplicity of values -- for file formats, for instance -- demands the database
-- from where such information is retrieved -- to reflect unequivocally such structure.
Processing level can be seen as _versions_ of the same (original) data product, and
represent the time (processing) evolution (history) of that data product.
Whereas multiple file formats represent different views (or viewing modes) for a
given data product (level).
### DB
Databses to store spatial data considered: MondoDB or PostGIS
Databases considered to store (spatial) data: MondoDB or PostGIS
MongoDB is a NoSQL, document-based database with support for WGS84 spatial index.
The index can hold polugons representing products footprint, allowing _intersection_
......@@ -37,7 +89,63 @@ database it is, table fields must be pre-defined (different from document-based
databases) and groups/blocks of fields are represented by tables that relate through a
_primary_ key/field.
The database is used internally, for the handling of all the data products and their
visualization and processing.
The database (db) works as the index of data products, connecting the storage (path)
they are to the interfaces (visualization, processing tools) they are used on.
The db is modified everytime a new data product is downloaded (_insert_ new entry in db)
or a data product is further processed to a higher-level (_update_ existing entry).
##### MongoDB
Besides providing support for (WGS84) spatial index, MongoDB's flexibility regarding
database schema fits perfectly to a project under development like ours.
_Documents_<sup>\*\*</sup> in a MongoDB database do not (necessarily)
follow a schema, allowing them a great deal of freedom on their content.
Data is organized in _collections_ (MongoDB's analogous to _tables_ on relational DBs),
collections host _documents_<sup>\*</sup>, and documents can be indexed if they all
have one (or more) _field(s)_ in common.
For instance, consider the following data/metadata associated to two products:
productId: "abc_01",
geometry: {
type: "Point",
coordinates: "[x,y]"
productUrl: "http://localhost/pub/abc_01.ext"
productId: "abc_99",
geometry: {
type: "Point",
coordinates: "[x,y]"
productUrl: "http://localhost/pub/abc_99.ext",
productPath: "/data/upstream/abc_99.ext
, these (two) _documents_ can be inserted in a _collection_, create a spatial index
using the `geometry` field, and another index using `productId` (unique).
Product `abc_99` was already downloaded and that's why it has a `productPath` field,
different from `abc_01`.
This is completely legal in MongoDB, and whenever such collection is queried for "all
products with 'productPath'", document `abc_99` will be retrieved but _not_ `abc_01`.
<sup>\*\*</sup> a _document_ is set of 'key:value' pairs, MongoDB's abstraction to
_records_ on SQL DBs.
## Data and metadata
_Data_ (products) are stored in a object/file store/system -- aka, _archive_ --,
while _metadata_ are stored in a database.
#### Store data and metadata
Store final products in archive and insert respective metadata to DB
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment