How to define a file format

Specifying a file format isn't just about deciding the structure (CSV, JSON, XML, etc.) and the fields that should be present in the file. Having been through the exercise recently in a collaborative exercise with the UK's OOH advertising industry, I've collated a checklist of things to consider.

One of the key lessons that came out of my recent work is that there should be no room for ambiguity. Be explicit in your specification. If the filename to be used does not matter to your solution, then say that. Don't just fail to mention filenames.

Checklist

Here's my checklist. I'd welcome feedback on other factors which I've not included.

  • The file format (e.g. JSON, CSV, XML, TSV, etc.)
  • The filename to be used. Is there a specific format? Should it be unique (and if so, over what period of time?) What filename extension should be used? Is it case sensitive?
  • Character encoding (e.g. UTF-8)
  • Carriage return format (CR, LF, CRLF). Must the final line of the file end with a carriage return?
  • Support of empty lines, white space and comments.
  • Support for headers (e.g. in a CSV file, a line at the top which describes the fields).
  • Need for delimeters (for example, must text field values be encapsulated in quote marks?)
  • Method of escaping (do you need to escape a quote mark which appears within a quote-delimeted value)?

Fields and field structure

We then come to the point of defining the fields and the field structure. Depending on your file format, structure may be limited (e.g. CSV). But a rich format like JSON allows for nesting and key-value pairs, and arrays allow for unordered sets of data.

Here's another checklist of things to consider when defining your fields:

  • The field name (especially if it constitutes the 'key' of a key-value pair, or will be specified in a header row. But a field name is essential even if it will not appear in the file, simply as a communication aid). Use specific and descriptive field names. Don't just say "start" and "end". Explain what it is the start and end of. Don't use "ID", but rather be more specific - e.g. order_id.
  • The type of the field (numeric, timestamp, text etc.) If it is a numeric field, how many bits if it is an integer? Signed or unsigned? How many decimal points? Are they mandatory? In a timestamp. is the timezone mandatory? Are you following a standard like ISO-8601? Is a text field case-sensitive or not?
  • Field length - especially pertinent to text fields
  • Value limitations. For example, is punctuation allowed? Is the minimum/maximum value?
  • A description. This is essential to a clear file format specification because it removes misunderstanding. Don't rely on a field name to tell the reader everything they know. If there is a field called "buyer_name", the description should explain whether this is the individual or the company, or both?

Mandatory or optional?

Consideration must be given to whether a field is mandatory or optional. For some specifications, this might be as simple as noting which fields are mandatory. But there is generally more to consider:

  • If a field is optional, can it be (should it be?) present but left blank? Is there a distinction between "I am deliberately omitting this optional field" and "The value of this field is blank"?
  • Does the value of one field influence whether another field is mandatory or not?

Finally, a picture paints a thousand words. For the reader, examples of files will often clear up any remaining uncertainty from the specification.

Do you put anything else in your file format specifications? Let me know in the comments.

Comments