File Formats and Data Types
What types of data are we talking about?
Data can mean many different things, but there are typically four main categories that it can be sorted into for management purposes. The category that you choose will then have an effect upon the choices that you make throughout the rest of your data management plan.
Observational
- Captured in real-time
- Usually irreplaceable
- Examples: Sensor readings, telemetry, survey results, images
Experimental
- Data from lab equipment
- Often reproducible, but can be expensive
- Examples: gene sequences, chromatograms, magnetic field readings
Simulation
- Data generated from test models
- Models and metadata, where the Input more important than output data
- Examples: climate models, economic models
Derived or compiled
- Reproducible (but very expensive)
- Examples: text and data mining, compiled database, 3D models
These data can come in many forms: text, numerical, mulitmedia, models, software, discipline specific (i.e., FITS in astronomy, CIF in chemistry), or instrument specific.
What are the issues around file formats?
One favorite saying is that the best part about standards is that there are plenty to choose from. This holds true for file formats, and means that it is important to think carefully about what file format will be best for long-term preservation and continued access to your data.
Consider the following:
- Accessible in the future
- Non-proprietary
- Open, documented standard
- Common, used by the research community
- Standard representation (ASCII, Unicode)
- Good if not software specific
Best Formats:
- Unencrypted
- Uncompressed
- PDF, not Word
- ASCII, not Excel
- MPEG-4, not Quicktime
- TIFF or JPEG2000, not GIF or JPG
- XML or RDF, not RDBMS