NEANIAS Gitlab

Commit 9bf68beb authored by Carlos H. Brandt's avatar Carlos H. Brandt
Browse files

Init script to compute datasets sizes

parent 6628bae9
"""
This script calculates the size volume of PDS* datasets.
In general, actually, this script is meant to read the sizes of files in a
given file server; Initially thought to access HTTP file servers, FTP may
be included at some point if necessary.
PDS data archives are typically provided through HTTP file servers, structured
in a predictive way -- following mostly PDS-3 standard.
The script uses Requests and Pandas to read the index of files from a given
URL. From the table containing files of interest (look for 'match' attribute)
information of size, for instance, is kept. Size values are given in Megabytes.
* PDS stands for Planetary Data Systems
"""
# Sizes are used in units of Megabytes
_factor = {
'K':0.001, # If value in Kilobytes, divide by K
'M':1, # If value in Megabytes, do nothing
'G':1000 # If value in Gigabytes, multiply by K
}
def url_table2df(url, match=None):
"""
Return a DF from first table of 'url' with files matching 'match'
"""
import pandas as pd
tabs = pd.read_html(url, match=match)
tab = tabs[0]
# Remove columns and rows with invalid values
tab = tab.dropna(axis=1, how='all').dropna()
# Transform file sizes to Megabytes
tab['Size'] = tab['Size'].apply(lambda s:s[:-1]*_factor[s[-1].upper()])
return tab
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment