NEANIAS Gitlab

Commit 386e2f53 authored by Carlos H. Brandt's avatar Carlos H. Brandt
Browse files

Get dsets py script parse regex

parent 1598873d
......@@ -50,7 +50,7 @@ Sources of our data are mostly PDS ( [ref_pds] ) data archives. See section [PDS
* source: NASA
* volume: 10 TB 
#### TBD:
<hr/>
* dataset: PLANMAP maps
* archive: PLANMAP
......@@ -313,14 +313,6 @@ The IVOA registry is an important resource for data discovery, publishing our da
**TBD: _Data products/sets naming conventions to follow_**
**TBD: _Will search keywords be provided that optimize possibilities for re-use?_**
### Version numbers
NEANIAS Planetary services will provide dynamic, on users demand data products, as such it does not foresee a schema to versionize its products other than an _expressive data product/file naming_ specification and a detailed logfile -- _i.e._, product definition.
......@@ -390,100 +382,5 @@ Compositional data from multi- and hyperspectral imagers (mostly for Mars, the M
* Rasdaman
* PostGIS
* ?
### Data Archives
* PDS
<hr/>
The decision of using standard protocols to access the data is supported by two aspects:
1. a well defined interface where a consistent set of uses cases is part of;
2. the availability or working implementations of clients (and servers).
**_Where will the data and associated metadata, documentation and code be deposited? Preference should be given to certified repositories which support open access where possible._**
During the duration of the project data will be stored by MEEO[@ref-meeo].
**TODO**: We have to, though, still solve where will the data and service live after the
NEANIAS project.
**_Have you explored appropriate arrangements with the identified
repository?_**
**TOD**
**_If there are restrictions on use, how will access be provided?_**
**N/A**
**_Is there a need for a data access committee?_**
**N/A**
**_Are there well described conditions for access (i.e. a machine readable license)?_**
**N/A**
**_How will the identity of the person accessing the data be ascertained?_**
**N/A**
**_Making data interoperable_**
The data downloaded by the user is formatted as GeoTIFF, PDS data-cubes, or
GeoPackage, open standards handled by a virtually all data analysis software
available.
That allows the users to analyse and compose with other data sources.
**_What data and metadata vocabularies, standards or methodologies will you follow to make your data interoperable?
**_Will you be using standard vocabularies for all data types present in your data set, to allow inter-disciplinary interoperability?
**_In case it is unavoidable that you use uncommon or generate project specific ontologies or vocabularies, will you provide mappings to more commonly used ontologies?
Increase data re-use (through clarifying licences)
How will the data be licensed to permit the widest re-use possible?
When will the data be made available for re-use? If an embargo is sought to give time to publish or seek patents, specify why and how long this will apply, bearing in mind that research data should be made available as soon as possible.
Are the data produced and/or used in the project useable by third parties, in particular after the end of the project? If the re-use of some data is restricted, explain why.
How long is it intended that the data remains re-usable?
Are data quality assurance processes described?
Further to the FAIR principles, DMPs should also address:
3. ALLOCATION OF RESOURCES
What are the costs for making data FAIR in your project?
How will these be covered? Note that costs related to open access to research data are eligible as part of the Horizon 2020 grant (if compliant with the Grant Agreement conditions).
Who will be responsible for data management in your project?
Are the resources for long term preservation discussed (costs and potential value, who decides and how what data will be kept and for how long)?
4. DATA SECURITY
What provisions are in place for data security (including data recovery as well as secure storage and transfer of sensitive data)?
Is the data safely stored in certified repositories for long term preservation and curation?
5. ETHICAL ASPECTS
Are there any ethical or legal issues that can have an impact on data sharing? These can also be discussed in the context of the ethics review. If relevant, include references to ethics deliverables and ethics chapter in the Description of the Action (DoA).
Is informed consent for data sharing and long term preservation included in questionnaires dealing with personal data?
6. OTHER ISSUES
Do you make use of other national/funder/sectorial/departmental procedures for data management? If yes, which ones?
* MongoDB
_re = {
'ws': '\s*',
'ts': '[0-9].*[ap]m',
'size': '\s*(?P<size>[0-9].*?)\s*',
'file': '<a href=.*>(?P<name>{match}?)',
}
archives = {
'ctx': {
'template': 'https://pds-imaging.jpl.nasa.gov/data/mro/ctx/mrox_{i:04d}/data/',
'match': 'IMG',
'kwargs': {
'i': range(5)
},
'parser': {
'pandas': {'id': 'indexlist'}
}
},
'hrsc': {
'template': 'https://pds-geosciences.wustl.edu/mex/mex-m-hrsc-5-refdr-dtm-v1/mexhrs_2001/data/{i:04d}/',
'match': 'h.*img',
'kwargs': {
'i': range(20,25)
},
# 'parser': '[0-9].*[ap]m\s*([0-9].*?)\s*<a href=.*>(.*h.*img?)',
#TODO: autoref kwargs to compile parser during load
'parser': {
're': _re['ts'] + _re['size'] + _re['file'],
}
},
'mola-global': {
'template': 'ftp://pds-geosciences.wustl.edu/mgs/mgs-m-mola-5-megdr-l3-v1/mgsl_300x/meg{i:03d}/',
'match': 'img',
'kwargs': {
'i': [4,16,32,64,128]
}
},
'mola-polar': {
'template': 'ftp://pds-geosciences.wustl.edu/mgs/mgs-m-mola-5-megdr-l3-v1/mgsl_300x/polar/',
'match': 'img'
},
# 'hirise': {
# 'template': 'https://hirise-pds.lpl.arizona.edu/PDS/RDR/{dl}/{orb}/{dlorb}',
# 'kwargs': {
# 'dl': ['ESP','PSP'],
# 'orb': list_dirs('*'),
# 'dlorb': list_dirs('*')
# }
}
}
......@@ -15,9 +15,12 @@ information of size, for instance, is kept. Size values are given in Megabytes.
* PDS stands for Planetary Data Systems
"""
import os
import re
import itertools
import requests
import pandas
# Sizes are used in units of Megabytes
_factor = {
'Size': {
......@@ -28,113 +31,276 @@ _factor = {
}
def url_table2df(url, match=None):
def build_urls(template, **kwargs):
"""
Return list of URLs built from 'template' and 'kwargs' combination
'template' is a string like:
```
https://pds-imaging.jpl.nasa.gov/data/mro/ctx/mrox_{i:04d}/data/
```
to which 'kwargs' would be something like:
```
{ 'i': [1, 2, 10, 1234] }
```
"""
Return a DF from first table of 'url' with files matching 'match'
def product_dict(**kwargs):
"""
Generate all permutations of 'kwargs' values and pair them back with keys
From https://stackoverflow.com/a/5228294/687896
"""
keys = kwargs.keys()
vals = kwargs.values()
for instance in itertools.product(*vals):
yield dict(zip(keys, instance))
urls = [ template.format(**d) for d in list(product_dict(**kwargs)) ]
# TODO:
# - remove repeated '/' in case a kwarg is an empty string list ([""])
return urls
def http_table2df(url, match=None, parser=None):
"""
Return a DF from first table of 'url' with files matching 'match'.
'parser' is actually not being used yet, may provide extra/pandas options.
"""
def url_exists(url):
import requests
request = requests.get(url)
if request.status_code == 200:
print('Web site {} exists'.format(url))
if request.status_code != 200:
print("Web site '{}' not accessible".format(url))
else:
print('Web site {} does not exist'.format(url))
print("Web site '{}' OK".format(url))
return request.status_code
if not url_exists(url):
return None
url_exists(url)
try:
tabs = pandas.read_html(url, match=match)
tabs = pandas.read_html(url, match=match, attrs=parser)
tab = tabs[0]
# Remove columns and rows with invalid values
tab = tab.dropna(axis=1, how='all').dropna()
except:
except Exception as e:
print(e)
tab = None
return tab
# alias for deprecation
url_table2df = http_table2df
def url_regex2df(url, match, parser):
"""
Parse a '<pre/>' defined table in an HTML document (from 'url').
'match' is used to filter the lines. In '<pre/>' pages, line breaking
<br> is used to break the document/string into lines.
'parser' is expected to provide a regex with 'name' and 'size' named
groups in it: `(?P<name>...)`, `(?P<size>...)`, ...
You can use 'match' inside 'parser' with a (?P<match>...) placeholder
"""
def crop_html_pre(html):
pattern = '<pre>(.*?)</pre>'
return re.search(pattern, html).group(1)
def filter_matching(pretext, pattern):
entries = pretext.split('<br>')
return [v for v in entries if re.search(pattern, v.lower())]
def parse_listing(entries, pattern):
out = []
for row in entries:
match = re.search(pattern, row.lower().strip())
filename = match.group('name')
filesize = match.group('size')
out.append((filename,filesize))
return out
html = requests.get(url).text
try:
pre = crop_html_pre(html)
except:
return None
fls = filter_matching(pre, match)
try:
files_size = parse_listing(fls, parser.format(match=match))
except:
return None
tab = pandas.DataFrame(files_size,columns=('Name','Size'))
return tab
def read_url_table(url, match=None):
def read_http_table(url, match=None, parser=None):
"""
Return 'url_table2df' output with "size" column in Megabytes
Convertion is used the module's constant '_factor'
"""
tab = url_table2df(url, match=match)
if not parser or 'pandas' in parser:
tab = url_table2df(url, match=match, parser=parser['pandas'])
else:
tab = url_regex2df(url, match, parser=parser['re'])
if tab is None:
return None
# Transform file sizes to Megabytes
col = 'Size'
fx = _factor[col]
tab[col] = tab[col].apply( lambda s: s[:-1] * fx[s[-1].upper()] )
try:
tab[col] = tab[col].apply(
lambda s: float(s[:-1]) * fx.get(s[-1].upper(), 10**-6)
)
except Exception as e:
print(tab)
print(e)
return tab
def build_urls(template, **kwargs):
"""
Return list of URLs built from 'template' and 'kwargs' combination
# alias for deprecation
read_url_table = read_http_table
'template' is a string like:
```
https://pds-imaging.jpl.nasa.gov/data/mro/ctx/mrox_{i:04d}/data/
```
to which 'kwargs' would be something like:
```
{ 'i': [1, 2, 10, 1234] }
```
"""
def product_dict(**kwargs):
"""
From https://stackoverflow.com/a/5228294/687896
"""
keys = kwargs.keys()
vals = kwargs.values()
for instance in itertools.product(*vals):
yield dict(zip(keys, instance))
def read_ftp_table(url, match=None):
from ftplib import FTP
urls = [ template.format(**d) for d in list(product_dict(**kwargs)) ]
# TODO:
# - remove repeated '/' in case a kwarg is an empty string list ([""])
return urls
def urlparse(url):
from urllib.parse import urlparse
return urlparse(url)
def read_urls(urls, match=None):
o = urlparse(url)
ftp = FTP()
ftp.connect(o.netloc)
ftp.login()
ftp.cwd(o.path)
dir_list = ftp.nlst()
files = []
for entry in dir_list:
if match and match not in entry:
continue
try:
size = float(ftp.size(entry))
except:
continue
files.append((entry,size))
tab = pandas.DataFrame(data=files, columns=('Name', 'Size'))
col = 'Size'
# FTP file sizes are in Bytes, transform to Megabytes
tab[col] = tab[col] * 10**-6
return tab
def read_urls(urls, match=None, parser=None):
"""
Just 'read_url_table' for each URL in 'urls', 'match' is applied to all
"""
tabs = []
for url in urls:
tab = read_url_table(url, match=match)
if url[:3].lower() == 'ftp':
tab = read_ftp_table(url, match=match)
else:
tab = read_http_table(url, match=match, parser=parser)
if tab is not None:
tab['Size'] = tab['Size'].astype(int)
tabs.append(tab)
return tabs
def write_tabs(url, tab, outdir='dset_tabs'):
OUTDIR = 'dset_tabs'
def write_tabs(url, tab, outdir=OUTDIR):
"""
Write down 'tab' as CSV files in same path as 'url' under 'outdir'
'tab' is expected to be a Pandas Dataframe
"""
from urllib.parse import urlparse
import pathlib
o = urlparse(url)
path_ = outdir + o.path
print(path_)
pathlib.Path(path_).mkdir(parents=True, exist_ok=True)
filename = os.path.join(path_, 'indexdf.csv')
tab.to_csv(filename)
def run(template, kwargs, match):
urls = build_urls(template, **kwargs)
tabs = read_urls(urls, match)
def run(template, kwargs=None, match=None, parser=None):
"""
- template: URL with placeholders from 'kwargs', if any
Optionals:
- kwargs: maybe a bad name for a dict for attribute in 'template'
- match: string pattern (regex) to match with lines/files from HTML
- parser: 'pandas' is an option, or a regex for retrieving "Name" and "Size".
Contrary to FTP, HTTP served pages are mostly non standard,
here it can be defined a regex using named groups "name/size":
(`(?P<name>...)`,`(?P<size>...)`)
If you want to access a `<table/>` element, use 'pandas' (or None)
"""
if kwargs:
urls = build_urls(template, **kwargs)
else:
urls = [template]
tabs = read_urls(urls, match, parser)
for url,tab in zip(urls,tabs):
if tab is None:
print("URL {} has no table".format(url))
continue
print("Writing tab from {}".format(url))
write_tabs(url, tab)
#return tabs
return tabs
if __name__ == '__main__':
archives = {
'ctx': {
'template': 'https://pds-imaging.jpl.nasa.gov/data/mro/ctx/mrox_{i:04d}/data/',
'match': 'IMG',
'kwargs': {
'i': range(5)
}
}
}
# archives = {
# 'ctx': {
# 'template': 'https://pds-imaging.jpl.nasa.gov/data/mro/ctx/mrox_{i:04d}/data/',
# 'match': 'IMG',
# 'kwargs': {
# 'i': range(5)
# }
# },
# 'hrsc': {
# 'template':
# 'https://pds-geosciences.wustl.edu/mex/mex-m-hrsc-5-refdr-dtm-v1/mexhrs_2001/data/{i:04d}/',
# 'match': 'img',
# 'kwargs': {
# 'i': range(20,25)
# }
# },
# 'mola-global': {
# 'template':
# 'ftp://pds-geosciences.wustl.edu/mgs/mgs-m-mola-5-megdr-l3-v1/mgsl_300x/meg{i:03d}/',
# 'match': 'img',
# 'kwargs': {
# 'i': [4,16,32,64,128]
# }
# },
# 'mola-polar': {
# 'template':
# 'ftp://pds-geosciences.wustl.edu/mgs/mgs-m-mola-5-megdr-l3-v1/mgsl_300x/polar/',
# 'match': 'img'
# }
# }
import datasets_parse_config as dc
archives = dc.archives
for dset in archives.keys():
pars = archives[dset]
print('Running {}'.format(dset))
run(**pars)
tabs = run(**pars)
total = 0
for tab in tabs:
if tab is None:
continue
size = tab['Size'].sum()
total += size
print('Partial sizes in {!s}: {:d} MB'.format(dset,size))
print('Total size of {!s}: {:.2f} GB'.format(dset,float(total/1000)))
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment