These informal notes elaborate on an earlier coarse proposal in the attempt to address some of the so called FITS perceived shortcomings
|
|
|
|
|
Each HDU in a MEF can always be transformed into a SEF (or exceptionally into a vanilla PIF if it is the PHDU), or "extracted" with a reasonably simple procedure.
What a particular project "stuffs" into extensions of a MEF, and what instead is
kept in separate FITS files (maybe distributed in a flat tar.gz, maybe organized
in subdirectories) depends on choices and requirements of the particular project,
However analysis software should be able to treat individual "data-full" HDUs as
"atomic units", irrespective of the fact they are a PIF, a SEF or a component of
a MEF
All above file arrangements share the fact that the position of the header in
front of the relevant HDU does not make easy to edit in place (in particularly extend,
but also curtail) the header in all cases this results in a different number of
2880-byte FITS blocks.
The typical content of a header in the various cases can be displayed clicking
on it (appears on the right hand side of the page).
I haven't considered above a particular type of extension which could be termed
"dataless HDU", which could be used to store just a collection of ancillary keywords.
The possible organization of the MHDU is described in the next section.
The present subsection describes only the arrangement and classification of FITS
file layouts as result from the insertion of MHDUs.
The metadata in the MHDU could be relevant to the entire FITS file, or to each
individual "true" HDU (i.e. non-dataless). In the latter case each true HDU can
be followed by its own MHDU.
I argument that typical working files should be either PIFs (plain images) or
SEFs (single BINTABLEs or TABLEs) or anyhow relatively simple
MEFs. As such a working file should have just a global MHDU at the end. A working
file is one which is intended to be manipulated (in place) by its users.
On the other hand, for archival and transfer purposes (of just for data
organization), it might be appropriate to have complex MEFs, which could be made
"packing together" several working files (inclusive of their MHDU).
We could define such files as FITS Archives (FAR).
What follows is a tentative preliminary classification of possible "enhanced" FITS
files.
Although a dataless HDU appended at the end of a MEF could overcome the drawback of
not easy header editability (4D), I am not aware of it being in widespread usage.
HDUm
extension header no data array Possible enhancement
The proposed enhancement would allow easy in-place editing of headers
without having to re-write too many data records (ideally none).
Essentially the idea is to keep only mandatory keywords in the HDU header
and move all "ancillary" keywords and general purpose metadata to a separate
HDU located at the end, and termed "metadata HDU" (MHDU).
There is space to argument, mainly for backward compatibility reasons, whether
some more important keywords could be mantained in the HDU header. E.g.
presently defined WCS keywords ("WCS V1.0") remain in the HDU header
newly defined WCS keywords (e.g. exploiting the long kwd name and long kwd
value improvemente), i.e. a sort of "WCS V2.0", belong to MHDU
However this would break the easy editability requirement, since the "private"
MHDUs will be followed by other entire HDUs and not be last in the FITS file.
FARs should not be edited by general users, but just by their creators.
We could also imagine a "far" and "unfar" utility to create a FAR from a collection
of individual FITS files, and to extract selected individual FITS files from the FAR.
A FAR should contain, ideally as first useful HDU, an index HDU(IHDU) with
the list, size and location of the archived components.
The possible organization of the IHDU is described in a separate section.
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
The header differs only in the first 1-2 lines and includes from 13-14 to 15-16 keywords (i.e. no more than one 2880-byte block).
| new METADATA extension | BINTABLE extension |
|---|---|
|
XTENSION= 'METADATA'
|
XTENSION= 'BINTABLE'
EXTNAME = 'METADATA' |
|
BITPIX = 8
NAXIS = 2 NAXIS1 = 1240 NAXIS2 = 13 PCOUNT = 0 GCOUNT = 1 TFIELDS = 2 or 3 TTYPE1 = 'KEYWORD' TFORM1 = '40A' TTYPE2 = 'VALUE' TFORM2 = '310U' TTYPE3 = 'KTYPE' TFORM3 = '3A' END | |
In the following example dump I anticipate the display of the third optional column
KTYPE as second for the sake of clarity.
Splitting of keyword value on several lines is a browser-dependent display artifact,
but long values actually are intended to occupy a single table row in the FITS file.
| KEYWORD | KTYPE | VALUE |
|---|---|---|
| DATE | A | 2013-12-05T17:14:42 |
| Observer | A | Ben Sugerman O'Hara |
| N_REPEAT_OBSERVATIONS | I | 2 |
| ESO.TEL.AIRM.START | E | 1.134 |
| PixelScale1 | D | -1.22334117768332 |
| BPIXFILE | A | /dsops/ap/sdp/opus/prs_run/tmp//ACIS_F_L1_08784n137/output/acisf00317_000N001_bpix1.fits |
| LICENSE | A | NRAO/VLA Archive Survey images (NVAS for short) come "as is" with no warranty and may only be used for scientific and personal purposes. Usage, publication and redistribution for scientific purposes is free of charge, provided that the proper image credit below is included. Any other usage of NVAS images MUST be approved by the NRAO director. |
| FITSINFO | A | FITS (Flexible Image Transport System) format is defined in 'Astronomy and Astrophysics', volume 376, page 359; bibcode: 2001A&A...376..359H |
| DISTORT_COEFFS | 7E | 2.9656E-06 3.7746E-09 2.1886E-05 -1.6847E-07 -2.3863E-05 -8.561E-09 -1.4172E-03 |
| LNG_CORR | 18D | 1. 4. 4. 2. 0.001557694712335736 0.3057102603054315 0.005633525451388214 0.1588647710905971 -2.724118002977662E-4 -5.016761279169355E-5 -6.340301962177909E-4 -1.079719294386445E-4 1.534942429540643E-5 -2.213747676057557E-4 2.327260049467782E-5 -4.002114665208721E-5 -4.983803710224544E-5 8.510395910450919E-6 |
| LAUNCHSITE | A | Байконур |
| ARandomUnicodeString | A | ႠႡႢႣ ԱԲԳԴ ΑΒΓΔ ᚠᚢᚦᚪ アイウ₨ ᄓᄔᄕᄖ 丳丵为主 कखगघ ሀሐሠሰ ཀཁགགྷ |
| Sissa.Chessboard.Rice.Grain.Count | undef | 18446744073709551615 |
On the other hand the need to support non-Latin scripts, or just a few odd characters
here and there, is not a driving scientific need. Therefore I see no reason to
add general Unicode support to FITS. Unicode support can remain confined essentially
to keyword values, and even there would probably have a limited scope.
That's why in the previous section I limited the introduction of
an Unicode data format only within the scope of the METADATA extension
(suggesting to make it nominally different from the general BINTABLE).
My experience with Unicode is limited, therefore what follows is to be regarded as highly tentative. Opinions and suggestions by real experts will be greatly appreciated.
According to the Unicode FAQ each Unicode code point (i.e. character or glyph) is univocally described by a 21-bit code. These codes are mapped to formats like UTF-8, UTF-16 or UTF-32. Since 21 bits are a rather odd size, the code charts give an hexadecimal description, which can be used in HTML escapes like Ⴀ or Java escapes like \u10A0 (mainly to handle a few odd characters in a name, like sometimes one uses LaTeX escapes, but cannot be considered a general solution).
UTF-32 looks the simplest though most space-expensive solution to deal with Unicode. Each code point is encoded as 4 bytes, which is always wasteful (32 bits instead of 21) and very wasteful for ASCII text (32 bits instead of 8). Conversely it has the advantage that the length in bytes b is fixed and easily predictable from the length in code points p : b=4p. However UTF-32 (and UTF-16) have the endianness problem.
UTF-8 is probably the most widespread solution dealing with Unicode (at least for plain text processing like e-mail etc.). It is a byte-oriented protocol, exempt from endiannes problems. It is also the most compact representation, with one code point being represented by 1, 2, 3 or 4 bytes in decreasing usage frequency order. In particular pure ASCII (7-bit ASCII including the restricted set commonly used by FITS) is preserved identically in UTF-8. The main "difficulty" is that one cannot a priori predict the length in bytes of the encoding of a string of p code points, unless one actually performs the conversion of a specific string.
I propose for the METADATA extension (and optionally for the entire BINTABLE extension) to support at least one, or both, of the following TFORM:
The default action of a FITS reader (displayer) on an Unicode sequence containing non-ASCII characters could be to output it as ASCII inserting HTML or Java escape sequences.
Note that I am not considering any more a possibility I thought of in the past, i.e. to store the actual Unicode text (of variable byte length) in the "heap area" (e.g. as a PB(m) variable-length array), since some people complain about using variable-length arrays for real data, and probably won't like at all to use them for "prompt usage" information like metadata keywords.
The wish to have preview thumbnails attached to a given FITS file was raised
in the shortcoming list (item 3G). Of course it does not make sense to have a preview
for a TABLE or BINTABLE, so I assume the request referred to FITS
images only.
A quick preview, like e.g. a bitmapped image, is something supported e.g. in PostScript
files, although not compulsory, and not often used.
Potentially a preview could be managed by a FITS viewer (lik ds9) reading the FITS
file and displaying it at low resolution (entire file at zoom setting such to fill
current size windows). But possible the requestor meant a preview in one of the
common image display formats like GIF, JPEG or PNG.
Therefore ultimately the request translates as a way to attach a non-FITS file to a FITS file, i.e. in a modern way to manage foreign files.
Such (relatively) "modern" way needs not to be unlike the way generic files are attached to an e-mail message, i.e. MIME. Like a Mail User Agent disposes of attachment of arbitrary MIME types, spawning an external viewer or handler, also a FITS viewer or reader could act accordingly encountering a "MIME extension" in a FITS file (I presume this should be prepended to the actual FITS HDU to which it refers.
MIME is defined by RFC 2045 (which I read) and RFC 2046 RFC 2047 RFC 4288 RFC 4289 RFC 2049 (which I did not read).
Essentially exchange of a MIME file implies the construction of an RFC-822-like
MIME header, followed by a blank line, following by the file content encoded in
one of the possible encodings. There are Linux utilities like base64 to
perform BASE64 encoding, and like metamail to analyse a MIME-formatted
file and dispatch it to the appropriate viewer.
This means handling of MIME files could be handled mostly by existing utilities,
e.g.
The MIME header consist of "keyword-value' pairs where the kewyord name is separated by a semicolon from the value. Usually a MIME keyword fits in e.g. a 72-byte line, but it might be continued (wrapped) on more lines if the first character of the next line is blank. For the purpose of FITS I do not propose to match MIME keywords to FITS keywords (the new long-name long-value keywords), but to map the entire MIME header to a sequence of standard (old style) keywords, eventually cutting MIME header records longer than 68 characters to such length and wrapping them to the next lines. Wrapping should occur rather seldom. Provision is made for up to 9999 MIME header lines (after wrapping).
A MIME extension could have a FITS header like this, where the red keywords are mandatory, and with a single parameter while the bluish MIME header is just an example.
|
XTENSION= 'MIME'
BITPIX = 8 NAXIS = . 1 NAXIS1 = size of encoded file in bytes excluding headers PCOUNT = 0 GCOUNT = 1 MIME0001= 'Content-Type: IMAGE/png; name=preview.png' MIME0002= 'Content-Transfer-Encoding: BASE64' MIME0003= 'Content-Description: an optional rather long description' MIME0004= ' which wraps to the next line' MIME0005= 'Content-Disposition: attachment; filename=preview.png ' ... END |
The data array of the HDU will be the encoded byte stream corresponding to the MIME file, padded with binary zeros to make up an integral number of 2880-byte blocks.
Alternatively one could imagine instead something like this (which in principle needs not to be a new standard extension, could be just a BINTABLE with EXTNAME='MIME'):
|
XTENSION= 'MIME'
BITPIX = 8 NAXIS = . 2 NAXIS1 = m max length of a MIME header record (not wrapped) NAXIS2 = number of MIME header records PCOUNT = size of encoded file in bytes excluding headers GCOUNT = 1 TFIELDS = 1 TFORM1 = 'mA' ... END |
the MIME header follows in the data array as a single-column BINTABLE (or ASCII table),
where each row contains a full MIME keyword (name and value).
The encoded file follows instead in the heap area, padded with binary zeros to make up
an integral number of 2880-byte blocks.
It is true that in the past observations results were distributed in flat form, (originally as a sequence of files on magnetic tape, later as individual files on a CD without a subdirectory structure). Nowadays, whether they are distributed on a medium (CD or DVD) or via the network (ftp or http), files can be arranged in a subdirectory tree. However often file names are exceedingly complex because they encode also items which could be (and often are duplicated as) names of branches in the tree, compare e.g. a name like 0079_0122901701_PNS00701IME.FIT.
For each column I provide (boldface) a short column name (in the style of "old" header keywords), a long column name, a tentative TFORM and an explanation, followed (in nomral typeface) by example values.
| EXTNUM | XTTYPE | XTNAME | XTLOC | XSIZE | XTASSOC | DISPOSIT |
|---|---|---|---|---|---|---|
| ExtensionNumber | ExtensionType | ExtensionName | ExtensionLocation | ExtensionSize | ExtensionAssociation | Disposition |
| J | A | A | J (K) | J (K) | 2J | A |
| Sequence number of extension in file (0 for dataless PHDU) | XTENSION of the extension | EXTNAME of the extension | Byte offset in FAR where extension HDU starts | Size of HDU in bytes | start and end sequence number of associated HDUs | Disposition : suggested filename and path, or URL |
| 0 | 'primary' | 'PHDU' | 0 | psiz=2880 ? | -1 | '' |
| 1 | 'BINTABLE' | 'INDEX' | 2880 | isiz | -1 | '' |
| 2 | 'IMAGE' | myDataImage | psiz+isiz | a | 2:3 | ImageDir/myDataImage.fits |
| 3 | 'METADATA' | 'local metadata' | psiz+isiz+a | b | 2:3 | '' /not applicable, it is 2nd HDU in same file |
| 4 | 'BINTABLE' | myEventObs1 | psiz+isiz+a+b | c | 4:6 | EventDir/Obs/1/event.fits |
| 5 | 'BINTABLE' | GTI | psiz+isiz+a+b+c | d | 4:6 | '' /not applicable |
| 6 | 'METADATA' | 'group metadata' | psiz+isiz+a+b+c+d | e | 4:6 | '' /not applicable |
| -1 | 'external' ? | 'seeInside' | -1 | ksize | -1 | http://some.site/some/path/external.fits.file |
| -1 | 'external' ? | 'BINTABLE' | -1 | msize | -1 | http://another.site/some/other/path/another.fits#5 |
| 7 | 'BINTABLE' | myAuxData | psiz+isiz+a+b+c+d+e | f | 7:10 | auxData/auxiliary.file |
| 8 | 'METADATA' | 'local metadata' | psiz+isiz+a+b+c+d+e+f | g | 7:10 | '' /not applicable |
| 9 | 'TABLE' | Catalog | psiz+isiz+a+b+c+d+e+f+g | h | 7:10 | '' /not applicable |
| 10 | 'METADATA' | 'local metadata' | psiz+isiz+a+b+c+d+e+f+g+h | i | 7:10 | '' /not applicable |
| 11 | 'METADATA' | 'global metadata' | psiz+isiz+a+b+c+d+e+f+g+h+i | j | -1 | '' /not applicable |
It is possible (as shown in the original sketch) to use a sequence of indexed keywords instead of BINTABLE columns to convey the same information above, however a binary table is probably handier to use.
Alternatively, the offset and size of each extension could be in 2880-byte FITS blocks instead than in bytes.
The underlying idea of a FAR is that it could be, if wished, unpacked (un-FARed) into individual SEFs or MEFs. All the HDUs with a same "extension association" shall be copied (block by block without opening and interpreting their content) into a new FITS file, prepending it with a dataless HDU, and eventually merging the global metadata with the local or group metadata.
The DISPOSITION shall ordinarily suggest either a flat filename or
a relative filepath (in the Unix/Linux slash-separated notation), subsuming
some sort of subdirectory structure.
Extraordinarily, the disposition might point to an external resource, a file residing
elsewhere (in such case the byte offset shall be meaningless, while the file size
may be either irrelevant or correspond to the actual file). This could e.g. point to
updated calibration files kept in a central store.
Even if the FAR is un-FARed, the INDEX extension can be used as a sort of
directory by a data organizer program.
There are possibly some similarities in the usage with the hierarchical grouping convention which are worth a deeper analysis.
Concerning the "support to Terabyte datasets which span more than one filesystem", I do not regard that it is satisfied, nor that it should be satisfied. If a dataset is so large that it requires a special hardware arrangement, it is also likely that it might need to be replicated only in a few dedicated places inside a project collaboration, and not distributed world-wide to generic users, which might instead be interested in retrieving portions of it (image cutouts, database table subsets, etc.). Therefore this matter is outside of the scope of FITS.
However, talking of "distributed datasets", one might imagine that "components"
of the dataset can be located in different sites, and do not need or are not wanted
to be replicated at each user's site (imagine for instance a calibration dataset
managed in a central site, where the latest update will always be available).
This case could be supported e.g. in a FAR if we allow that some "virtual" HDUs
in a FITS file are not physically present within the file, but are instead
pointed by some form of URL or URI, as suggested above.
What it is really intended is that a FITS file whose length is unknown, cannot be transmitted until the size is not known. Or conversely that a FITS file being received cannot be processed or displayed until it is received in its entirety.
The usual workaround for situations like this is the usage of a staging file (data to be transmitted are accumulated in a temporary file, and transmission starts only when the file is complete; ... potentially one can also think to store the file being received at the receiving end, and update the header when transmission is complete).
One can further specify the requirement noting that, if the dataset is an image, its sizes shall be known at the beginning, before the image is filled (even in case of a slow readout from some device, the size along the reading direction shall be known). So the streaming requirement is not applicable to the image case.
The requirement is instead applicable when size in one direction is in principle undefined and unknown until the end. This applies to tabular data where the number of rows is not known. Consider e.g. a photon list of undefined length.
Therefore the requirement becomes the possibility of transmitting a BINTABLE (or ASCII table) where NAXIS2 is not known.
In principle the user at the receiving end could immediately start working with the
table being received once each entire row has been received, keeping internally the
counts of the rows.
This is not unlike processing PostScript files sequentially, page by page, where the
Adobe DSC directives like e.g. %%Pages are declared (atend) in the
prolog.
Of course this presupposes that the table is stored by consecutive rows, as in fact actually is in FITS, and contrasts with the request by "shortcoming 4I" to store tables alternately by rows!
So we could consider that a FITS HDU stream (which as such will not exist as a file or part of it on disk anywhere, except perhaps as an emulation for testing purposes, but only e.g. as return to an http request) could be implemented if: