An Open Knowledge Foundation Labs Project
This project is a community-driven effort from OKFN Labs – sign up now to get involved

This page provides an overview CSV (Comma Separated Values) format for data.

CSV is a very old, very simple and very common "standard" for (tabular) data. We say "standard" in quotes because there was never a formal standard for CSV, though in 2005 someone did put together a RFC for it.

CSV is supported by a huge number of tools from spreadsheets like Excel, OpenOffice and Google Docs to complex databases to almost all programming languages. As such it is probably the most widely supported structured data format in the world.


The Format

Key points are:

  • CSV is probably the simplest possible structured format for data
  • CSV strikes a delicate balance, remaining readable by both machines & humans
  • CSV is a two dimensional structure consisting of rows of data, each row containing multiple cells. Rows are (usually) separated by line terminators so each row corresponds to one line. Cells within a row are separated by commans (hence the C(ommmas) part)
    • Note that strictly we're really talking about DSV files in that we can allow 'delimiters' between cells other than a comma. However, many people and many programs still call such data CSV (since comma is so common as the delimiter)
  • CSV is a "text-based" format, i.e. a CSV file is a text file. This makes it amenable for processing with all kinds of text-oriented tools (from text editors to unix tools like sed, grep etc)

What a CSV looks like

If you open up a CSV file in a text editor it would look something like:

A,B,C
1,2,3
4,"5,3",6

Here there are 3 rows each of 3 columns. Notice how the second column in the last line is "quoted" because the content of that value actually contains a "," character. Without the quotes this character would be interpreted as a column separator. To avoid this confusion we put quotes around the whole value. The result is that we have 3 rows each of 3 columns (Note a CSV file does not have to have the same number of columns in each row).

Dialects of CSVs

As mentioned above, CSV files can have quite a bit of variation in structure. Key options are:

  • Delimiter: rather than ',' can be ';', '\t' (tab), '|' etc
  • Record / line terminator: is '\n' (unix), '\n\r' (dos) or something else ...
  • How do you quote records that contain your delimiter

You can read more in the CSV Dialect Description Format which all defines a small JSON-oriented structure for specifying what options a CSV uses.

What is Bad about CSV?

  • CSV lacks any way to specify type information: that is, there is no way to distinguish "1" the string from 1 the number. (This shortcoming can be addressed by adding some form of simple schema. See e.g. Simple Data Format for an approach whereby CSV is combined with a schema or you can even inline schema information - see Linked CSV for an example of this approach).
  • No support for relationships between different "tables" (this is similar to the previous point - again see Simple Data Format for a way to address this with additional schema information) information)
  • Works best for tabular data - not good for data with nesting or where structure is not especially tabular (though remember most data can be put into tabular form if you try hard enough!)

Specifications and overviews:


Tools

The great thing about CSV is the huge level of tool support. The following is not intended to be comprehensive but is more at the electic end of the spectrum.

Desktop

All spreadsheet programmes including Excel, OpenOffice, Google Docs Spreadsheets supporting opening, editing and saving CSVs.

View a CSV file in your Browser

You can view a CSV file (saving you the hassle of downloading it and opening it). Options include:

Unix Command Line Manipulation

See

Power Tools

  • OpenRefine is a powerful tool for editing and manipulating data and works very well with CSV
  • Data Explorer supports importing CSVs and manipulating and changing them using javascript in the browser

Libraries

This is heavily biased towards python!

Python

  • Built in csv library is good
  • The wonderful csvkit (python)
  • messytables (python) - convert lots of badly structured data into CSV (or other formats)

Node

Nothing in standard lib yet and best option seems to be:


Tips and Tricks

CSVs and Git

Get git to handle CSV diffs in a sensible way (very useful if you are using git or another version control system to store data).

Make these changes to config files:

# ~/.config/git/attributes
*.csv diff=csv

# ~/.gitconfig
[diff "csv"]
  wordRegex = [^,\n]+[,\n]|[,]

Then do:

git diff --word-diff
# make it even nicer
git diff --word-diff --color-words

Credit for these fixups to contributors on this question on StackExchange and to James Smith.