Enter and edit collection data in tables

Tables are still a common tool for recording and editing data on collection objects. When no database is available, tables – most often created with Microsoft Excel – are the standard way of working with data.

Are tables still up to date? Are they merely a stopgap solution?

Quite apart from the fact that many collections simply lack alternatives – that is, “proper” databases – tables offer a number of advantages:

  • Even larger amounts of object data remain easy to survey, and can be searched, filtered, and sorted, making it easier to clean and standardise data.
  • Repeated values can easily be entered for many objects at once.
  • With basic office experience, tables are quick and easy to use.

Even if a dedicated database is available, it can still make sense to use tables for certain tasks, especially when preparing bulk imports.
In any case, it is better to work in tables than in text documents (e.g. MS Word), because tables capture data in discrete fields: the columns define what a given unit of information – the contents of a cell – refers to.

Thinking from the end result: what are the “actual” table data?

In this sense it is important to keep in mind that working with tables is, at its core, about the structured recording of information.

Different spreadsheet programs are in use (besides MS Excel, for example LibreOffice Calc or Google Sheets), but in the end only one thing matters: what can be used for further processing in digital systems or for software-independent archiving. This is exclusively the information that is preserved when exporting to .csv (or .tsv). The Comma Separated Values (.csv) format is a plain text file in which the column titles are contained in the first row and the data in all subsequent rows, each value separated by commas (or tabs in the case of .tsv). All other information – formatting, data type definitions, and so on – is lost when converting to CSV. For this reason, from a data perspective it is irrelevant which spreadsheet program is used: in the end, only the CSV counts.

It is best to perform a CSV export early in the project, to become aware of what you are actually producing as a relevant result.

What to watch out for

A table is a table, not a text document

The basic structure of a CSV file should also shape the table in the spreadsheet program you are using: the header row defines the columns, and each row represents a single record, that is, a data object. Blank rows, inserted explanatory notes, or merged cells in individual rows obscure this structure and make extensive and error-prone clean-up necessary before converting to CSV. Formatting options – such as highlighting and freezing the header row or wrapping text within cells – can be very useful for data entry and improving readability, but they should never influence the structure of the data itself. A test export to CSV provides clarity here.

Standards-compliant column headings

Ideally, column headings follow a consistent, data-safe system, for example: first_name, last_name, born_in, and so on. Commas and full stops should be avoided in column headings, as they can lead to misinterpretations when processing CSV data further.

Drop-down values for efficiency and data quality

If identical values are to be inserted in a large number of cells, this can easily be done in tables using copy and paste or autofill (“dragging down” a cell). A much more robust solution in terms of data quality is to work with drop-down lists (controlled vocabularies). To do this, you create lists of valid values in a separate sheet; in the main data sheet, these values can then be selected from a drop-down menu. This selection can in turn be applied to many further cells via autofill. Instructions for implementing this functionality in your spreadsheet program of choice can usually be found under keywords such as “data validation”, “validation rules”, “drop-down lists” or “pick lists”.

This approach has proven valuable in data preparation projects, especially when working with large and complex data structures. It is important to remember, however, that the link between the data field (the value in the cell) and the drop-down list exists only at the moment of data entry, when the allowed values are restricted by the list. If you later change entries in the list (for example, replacing “portrait” with “head-and-shoulders” or “bust”), the values already entered via the list remain unchanged; they are not automatically updated. You would therefore have to replace all instances of “portrait” in the relevant column with “bust” afterwards. Adding new entries to the list or changing the order of entries, on the other hand, is unproblematic.

Multiple values in separator-separated lists, not in multiple columns

Tables can easily represent “flat” data structures, that is, data structures in which each cell contains exactly one value:

Titel,Inventarnummer,Beteiligte Körperschaft,Bild,...
Wurtzit-Struktur,HU-KRI-S-009,Humboldt-Universität zu Berlin,image-258.jpg, ...

That is:

{
  "Titel": "Wurtzit-Struktur",
  "Inventarnummer": "HU-KRI-S-009",
  "Beteiligte Körperschaft": "Humboldt-Universität zu Berlin",
  "Bild": "image-258.jpg",
  ...
}

Where several values belong to a single field, for example:

{
  "Titel": "Wurtzit-Struktur",
  "Inventarnummer": "HU-KRI-S-009",
  "Beteiligte Körperschaft": "Humboldt-Universität zu Berlin",
  "Bilder": 
    [
      "image-258-0.jpg",
      "image-258-1.jpg",
    ...
    ],
  ...
}

there is no longer a 1:1 correspondence between dataset and table row. In such cases, the values must be entered in the cell as a list, with each list entry separated by a consistently used separator, for example:

...,Inventarnummer,Beteiligte Körperschaft,Bilder,...
...,HU-KRI-S-009,Humboldt-Universität zu Berlin,"image-258-0.jpg, image-258-1.jpg",...

If commas are used as the separator, it must be clear that they separate list entries, not columns. A correctly configured CSV export takes care of this automatically by enclosing the entire list in quotation marks. However, if the individual list entries themselves contain commas (or any other separator that has been chosen), it is no longer possible to tell where one list item ends and the next begins. It is therefore essential to choose a separator that never appears within any list entry. When working with such lists of values, post-processing is required after CSV export in order to generate the desired data structure. As long as the separators are unambiguous, this can be handled programmatically with relatively little effort.

By contrast, it is not advisable to create multiple columns “just in case” for the same type of value:

Titel,Inventarnummer,Bild1,Bild2,Bild3,...

Such tables are confusing and inflexible, and CSV data exported from them is difficult to process further.

What can go wrong?

Deceptive convenience – automatic incrementing

One major advantage when filling in tables is that values can be transferred to further rows using autofill (“dragging down” a cell). However, this comes with a pitfall: many spreadsheet programs try to be “helpful”. Starting from “Department 1”, autofill will suggest:

“Department 2”

“Department 3”, and so on.

Starting from 14.05.1986, you get:

15.05.1986

16.05.1986, and so on.

If all three objects actually belong to Department 1 and were acquired on 14 May 1986, you have unintentionally created incorrect data. In all spreadsheet programs this behaviour can be avoided or corrected quite easily; the real risk is simply not noticing the error at all. It is therefore strongly recommended to pay close attention to this issue.

Unwanted data conversions

Spreadsheet programs offer convenient support for entering certain data types, which can, however, lead to unintended effects. For example, if you define a field in MS Excel as a date and set the date format to “14.03.2012”, you can enter “14/5/01” and automatically obtain the “correct” format “14.05.2001”. If you then reformat this column or the entire table as “text”, the value in that cell becomes “37025” – and this is exactly what appears in the CSV export. The most robust solution to this problem is to work exclusively with text data and forego such input conveniences. Otherwise, it is crucial to take great care not to change the data type of a column or cell unintentionally.

Wrong character encoding in CSV conversion

A classic fright after exporting to CSV is discovering cryptic characters in place of ä, ö, ß, and other special characters. In such cases, the wrong character encoding has been chosen. For European languages, UTF-8 is usually the best option. MS Excel in particular can be confusing here (when using “Save as”, you have to choose the right one from several available CSV formats). A tried-and-tested approach is to open the .xlsx file in the free program LibreOffice Calc and export it from there; the configuration options are much more user-friendly. In any case, you should carefully check immediately after export whether the correct character encoding has been used and all special characters (umlauts, etc.) are displayed correctly.

Confusion between separators within and between values

Another classic mishap when exporting to CSV occurs when values containing the separator comma are not enclosed in quotation marks. In this case, the text “It is not clear, whether the exhibit ...” is interpreted as two separate values: “It is not clear” and “whether the exhibit ...”. At the same time, the number of columns in that row no longer matches the others – the CSV data are thus unusable. Most spreadsheet programs solve this problem automatically during export. Nevertheless, it is always advisable to check whether the export has handled such cases correctly.