Humanities Data Management

Introduction

All research involves working with data, whether as objects of study and analysis, or as products of the research process. Data are the evidence you collect or create that enable you to answer your research question.

In Digital Humanities (DH) research, data often also constitute valuable outputs in the form of digital resources, such as:

  • encoded texts
  • databases
  • images of artefacts
  • digital collections.

Research data management (RDM) encompasses the collection, management, preservation and sharing of research data. It includes planning for data management at the research development and grant application stage, managing data on a day-to-day basis during the project, and preserving and sharing data on completion of the research and publication of findings.

RDM is good research practice. It also helps you adhere to the University’s Research Data Management policy and any funder’s policies on data preservation and sharing.

This guide provides a brief introduction to the essentials of data management for DH research. It covers the key areas you will need to think about when you are planning a research project. We encourage you to use them in preparation for further discussion.

The Research Data Management website provides more detailed information about the topics in this guide. There you can find information about getting in touch with the Research Data Service for a data management consultation. It also contains links to publications and other resources relating to data management for DH research.

The Digital Humanities Hub can provide support for research planning, data modelling, choice of appropriate software, and specification of technical requirements. Contact the Hub at digitalhumanities@reading.ac.uk.

Contents

What are Digital Humanities data?

In the broadest definition, data are simply any information collected or created in order to answer a research question, and may include the primary objects of study, such as books, paintings, documentary sources and secondary literature.

DH data are not the objects in themselves, but a digital representation of those objects, or of structured information related to them, which is created and analysed by computational methods.

In DH research, there is significant overlap between the ‘raw data’ (e.g. texts) and the metadata that enable them to be read and processed by computers, or the software or computer code by means of which they may be generated, processed, rendered and analysed.

It is important to record and disclose your data collection process, both for yourself and for transparent and open scholarship. Digital objects often seem to be more objective than more traditional outputs, when in fact, your choice of elements to record about an object has a huge amount of influence over how the information is ‘read’ by users. Transparency is an important principle in the DH and Open Research communities.

But do I have data?

Yes! Any of the following count as data:

  • Excel spreadsheet
  • Text files, including transcriptions
  • List of quotations from texts relevant to your topic
  • List of archive/library catalogue numbers that you need to consult
  • Images of documents in the archives that you took on your phone
  • Survey responses
  • Sound files e.g. music, recordings of interviews and oral history

In the digital age, creation of these forms of data does not in itself mean you are engaging in digital research. You must treat your data in a certain way in order to use digital technologies on it, and this is where data management comes in.

Digital approaches to your data

Any digital representation is necessarily partial, and the parts chosen to be represented depend on their function in answering the research question. Conversely, the elements you choose to record about an object will govern what questions a computer can ask about it.

For example: texts might be marked up in different ways to isolate different elements, concepts or themes; a database might define and bring into relation specific units of information that have been collected and organised for the purpose of answering a given question.

In DH research, data are often implemented in a digital resource that enables human and computer interaction. As well as your raw data, you will have to think about tools for engaging with it: the metadata, which makes it readable by the computer, and the interface, which presents your data in meaningful ways to human users.

For example, this could be:

  • an online query interface to a database,
  • a visualisation tool,
  • an Application Programming Interface (API) that allows users to retrieve information from a server, such as the IIIF (International Image Interoperability Framework) viewer: this allows comparison of high-quality images from multiple libraries around the world, that have digitised their materials in the IIIF format.

DH data management in action

Here are some examples of DH data creation – in the course of this guide, you will learn more about the data management tools that supported these DH projects, and ways to think about them in relation to your own project:

  • Transcribe Bentham: A project to digitise the unpublished manuscripts of Jeremy Bentham, which recruited members of the public to transcribe and encode them in Text Encoding Initiative (TEI)-compliant XML to create a searchable online collection.
  • Medieval and Tudor Ships: A team of historians uses customs accounts, navy payrolls and government ship surveys to create a georeferenced database of medieval merchant ships and voyages they undertook, as well as a map of ports and associated ships and voyages.
  • CLiC Dickens: Corpus linguisticians create a set of 19th century fiction corpora which are analysed for semantic and stylistic features using an application programming interface (API) and the Python and R programming languages to generate cluster frequency tables. See Mahlberg M. et al. (2019), ‘Speech-bundles in the 19th-century English novel’. Language and Literature. 28(4):326-353 (DOI).
  • Codex Zacynthius: A team of biblical scholars captured multi-spectral images of ancient palimpsests containing religious texts that cannot be fully read by the human eye. The images were used to create an online edition, comprising images of the undertext, tagged using the International Image Interoperability Framework (IIIF) standard, and a translation and transcription marked up in TEI XML.

Principles for working with data

Here are the key baselines for how you should be approaching data collection and management, as well as some keywords that will be used throughout this guide.

FAIRness

The FAIR Data Principles are a schema that was created to provide guidelines for the effective and sustainable management of research data.

FAIR stands for the characteristics of well-managed data:

  • Findable,
  • Accessible,
  • Interoperable and
  • Re-usable.

The Principles are mainly focused on data preservation, sharing and re-use, but the ultimate FAIRness of data can be determined by many factors, including decisions made before data are collected, the management and curation of data during a project, and the choice of solution for long-term maintenance of data after the project concludes.

Metadata

Metadata (‘data about data’) are the recorded information by which data are represented, discovered, accessed and interpreted. Metadata also enable data objects to remain viable in the long-term, for example by means of file checksums that can be validated at intervals as part of a preservation workflow. Metadata are integral to data FAIRness.

Machine-actionable

Both metadata and data need to be machine-actionable. This is the capacity of computational systems to find, access, and interoperate with data with no or minimal human intervention.

For example, a http:// URL links to a web page because it conforms to the hypertext transfer protocol for web addresses; an XML document encodes instructions that a computer program can parse to render, search and analyse marked-up content.

Data repositories

A data repository is a service that exists to preserve and provide access to data. Among other functions, a repository publishes machine-actionable metadata describing a data object so that it can be indexed by discovery services, and assessed for relevance and validity by a prospective user.

Digital Object Identifier (DOI)

A Digital Object Identifier (DOI) is usually included in the metadata stored in a data repository.

It is a persistent, unique identifier, by means of which the data object can be cited and retrieved, even if its location changes.

It may also include information about the terms under which the data object can be used.

Standards

Standards are ready-made schemas for the description of data. Data types and models, file formats and metadata all depend on the choice of which standards to use.

Standards exist for many different types of data in different disciplines, and they are integral to the interoperability and re-usability of data.

It is easier to discover data if the metadata conforms to existing standards and consistent rules. Standard, open file formats are easier for computers to process for analysis.

The use of standards also helps sustain the data as viable, re-usable objects in the long term. There is more about standards in the section below.

Data Management Planning

Data management is integral to the research process and needs to be planned for from the outset. How you acquire, store, organise, describe, preserve and provide access to the data you collect and the digital resources you create will determine their quality, accessibility, usability and sustainability.

A Data Management Plan (DMP) is a practical tool that will enable you and your team to be organised and work efficiently, to identify data management requirements and manage risks, and to apply appropriate solutions.

You may be familiar with DMPs from having to submit one as a requirement for grant applications, but it is worth doing one even when it is not required. It is especially useful for professional services colleagues who may be supporting your project.

You can use the DMP to factor in staff and other resource requirements for data management activities, and ensure that roles and responsibilities are clearly defined and allocated. A DMP can help you record relevant metadata, maintain the quality of your data, and preserve and share data according to the FAIR Principles.

Some aspects of your project covered by the DMP

  • Funding you might need for storage, computing resource, use of hardware or software, procurement of technical expertise, hosting and support of web resources, and long-term preservation
  • Making sure that your work is in line with the University’s and funder’s policies on preservation and sharing
  • Taking account of intellectual property rights in your source materials, and how these determine what you can do with your outputs
  • What you are going to do with any confidential data (e.g. if you are studying unpublished materials containing details of living individuals, which would fall under data protection laws)

Developing your DMP

A DMP is best created using a template or checklist. This will ensure you cover everything that needs to be considered. The RDM website provides guidance on writing a DMP, including information about templates you can use to help you create a DMP.

You should start developing the DMP when you are designing the project. Some funders, such as AHRC, the British Academy, and Horizon Europe, require different types of DMP in the application. You can find more information about specific funders’ requirements on the RDM website.

If you have to include a DMP as part of a grant application, your Research Development Manager will refer the application to the Research Data Manager, Robert Darby, who must review it prior to submission. Robert can help you develop your DMP and specify and cost any data management requirements, so we encourage you to get in touch with him independently, as early as possible in the writing process.

Once you have been awarded funding, you will need to create a more detailed practical plan for the project. Your DMP created as part of the funding application is a starting point, but your team will need to refine it in more detail. We recommend planning tools that can be used for this purpose. This DMP should be a living document, which is kept under review and updated as you progress through the project. You can make the DMP a standing item in project meetings, and schedule updates as part of your project’s annual progress reviews.

Key points

  • A Data Management Plan (DMP) enables you to keep track of your data over the course of the project, and keep yourself accountable to any institutional or funder requirements
  • Start developing the DMP as early as possible in the process of designing your project
  • Some aspects of the DMP may require funding and so these must be costed with assistance from your Research Development Manager and the Research Data Manager
  • Your project team and any colleagues supporting your project will need a more detailed DMP than any funder requires

Data types and data models

During research planning you will need to define your data types and data models.

Data types can be described in terms of the media you will use, such as text, images, sound and multimedia. You can also define your data based on its degree of structure (see below).

A ‘data model’ is simply your approach to organising the data. This needs to be recorded because it governs what questions you will be able to ask and what sorts of answers a computer will produce. This is essential for other scholars to be able to understand your method and see if they can arrive at the same conclusions.

Structured and unstructured data

Unstructured data does not explicitly define units of information or meaning, or the relationships between them. In order to make this unstructured data readable and actionable by a computer, we have to represent them within a structured or semi-structured data model.

Unstructured: Source data or the raw data obtained from them, for example, a transcription of a document as plain text.

Structured data: Data tables, databases, or bibliographies. In these models, individual items have identifiers and their relationship to other items is defined. Quantitative and factual information are best suited to structured data models, because it is uniform (i.e. all outcomes are equally likely) and discrete (meaning that it is made up of individual values, for example the age and height of several people, vs. how one person’s height changes as they get older, which would be continuous data).

Semi-structured data:

  • XML documents, which facilitate the presentation and analysis of textual materials, by imposing a degree of structure on plain text via mark-up, but leaving its application more open to interpretation than structured data. XML uses a hierarchical data model, with data expressed in a tree structure (book > chapter > paragraph). There are standard practices for mark-up that have developed large communities: the Text Encoding Initiative (TEI) guidelines are a widely used XML format standard for the markup of humanities texts.
  • Multi-relational or graph-based, as in a network relationship model or the Resource Description Framework (RDF) standard for web-based data interchange. This type of model can describe both qualitative and quantitative relations between entities and can be used to build highly complex relational graphs. RDF has been used in the humanities to describe cultural heritage data and to create Linked Open Data (LOD).

Processing your data

  • What hardware and/or software will you use to create and process the data? This might include specialist photographic equipment, software (e.g. text editors, network analysis applications), Virtual Machines
  • If there are different tools that can be used for similar purposes, which is most suited to your requirements?
  • Do you or your project team have the skills required? If not, can you acquire them by training or will you need to outsource or incorporate another project team member?
  • What is the scale or quantity of your data? For example, how many database records you will create, how many images/transcriptions in a digitisation project and the quality and size of the files. It is important to have an idea of this so that you can factor it into your schedule and budget.

The Digital Humanities Hub can help you answer these questions.

File Formats

A file format is an information encoding standard. A great variety of formats has evolved to encode text, numbers, sound, images, and multimedia. The encoding instructs software on how to read and action the file. Some formats you probably know well are .docx (Microsoft Word document) or .jpg (JPEG image).

Data may exist in different formats throughout the lifetime of a project, and you will need to consider optimal formats for processing, analysis and long-term preservation and use. The choice of format will depend on its purpose.

For example:

  • Bibliographies can be rendered as a text file (unstructured), an XML file (semi-structured), or a CSV table (structured).
  • Manuscript images might be captured as a high-resolution uncompressed TIFF file for analysis and long-term preservation, and a smaller low-resolution JPEG for web display.

Proprietary vs open file formats

Consider also whether the format is proprietary or open. A proprietary format is owned and maintained by a software provider, such as Microsoft, and some aspects of the file will not appear properly in other software. An open format contains basic values that can be rendered by a wide range of software.

Microsoft Excel spreadsheet format (.xls) vs CSV (comma-separated values, .csv)

CSV is an open format that renders text using a comma separator so that when you open them in a piece of software, it can render them into a basic tabular format, with rows and columns. CSV can be opened and read in a text editor, such as Notepad, or a spreadsheet application, such as Excel, Numbers, or OpenOffice.

Excel builds on the basic tabular format with complex encoding for formatting, display and manipulation. Actions in Excel, from changing the font or display colour to using macros and pivot tables, change the code of the file, in ways that can’t always be read by other software.

For long-term accessibility and preservation, we encourage you to use open file formats, for a number of reasons:

  • Data created with proprietary software may not be readable/usable by users without that software (for example, collaborators who are independent scholars and don’t have an institutional licence to Excel)
  • Proprietary formats are liable to obsolescence because they are dependent on a commercial organisation, and not maintained by a community or standards organisation. This is especially dangerous when software companies update their products without backwards compatibility. You may need access to older machines in order to read older file formats.

More information

Metadata

Metadata, which means simply ‘data about data’, are the information that enable data to be used.

If the data are the things that you, the researcher, are analysing (for example, a text), the metadata is what the computer is reading – it tells it that this is a text, and how to respond to it. It is a more complex version of the drop-down menus in Excel which ask you to class a cell as free text, a number, a date, and so on.

Metadata works best when terminology is consistent – for example, naming conventions and standardised spelling enable accurate search and retrieval. Without correct metadata, a computer would think ‘Guy Fawkes’ and ‘Guido Fawkes’ were different people. These sorts of things need to be taken into account during the planning and data modelling stage.

Metadata might exist at different levels depending the scale of the data it is describing.

Levels of metadata

Individual data files need metadata to be interpreted, validated, and processed.

Database metadata might include a logical representation of the data model, and a description of the data tables providing information about the component fields and the attributes or values they can take.

XML documents and IIIF-compliant images both use metadata to enable machine-processing of the objects they describe.

Collections of objects are assigned metadata when they are accessioned into a preservation and discovery service, such as a data repository or an archive. The metadata record represents an object. It contains fields that give the computer enough information so that the record can be discovered, located, retrieved, interpreted and cited. Library catalogues or a bibliographic discovery service such as Google Scholar or Scopus function in this way.

Essential metadata properties

Most datasets are assigned a Digital Object Identifier (DOI) and placed in a repository. Many repositories are designed especially for data relating to a certain field, and come with their own schemas to make their holdings consistent.

One such schema is the DataCite Metadata Schema. DataCite is the international registration authority for Digital Object Identifiers (DOI) which give resources a consistent location on the web. If a dataset is assigned a Digital Object Identifier (DOI), the metadata is registered in the DataCite Metadata Store.

There are 6 mandatory properties for metadata records for datasets, in order to generate an accurate citation:

  • Creator
  • Title
  • Publisher
  • Publication Year
  • Resource Type
  • Identifier

Other optional metadata that may be registered with DataCite include:

  • Description of the dataset
  • Temporal and geographical markers
  • Information about rights-holders in the dataset
  • The licence terms under which it is published

Metadata standards

In this sense, ‘standards’ are pre-existing sets of categories for metadata, developed by researchers and communities for specific subjects.

For most projects, you will not need to design your own metadata categories unless that is part of a research question. You can use these pre-existing sets. This makes your data interoperable with other scholars’, and future projects might be able to integrate metadata from multiple previous projects.

The Dublin Core Metadata Element Set is a widely adopted set of generic metadata elements for describing resources. DataCite Metadata Schema is an extension of this. These metadata can be expressed in different computer languages, including XML and JSON, enabling other services to process them.

There are also disciplinary metadata standards for specific types of data. These are often used by data repositories and data centres in the relevant fields. As we have seen, the Data Documentation Initiative (DDI) is an international standard for describing social science data and research objects.

For more information about metadata standards, visit the Metadata Standards Directory, a community-maintained directory hosted by the Research Data Alliance.

Key points

  • Metadata works best when you use consistent terminology, such as naming conventions and standardised spelling
  • Metadata for a collection or bibliography enables the computer to retrieve the items you need
  • The minimum metadata fields for data in a repository are Creator, Title, Publisher, Publication Year, Resource Type and Identifier
  • Many scholarly disciplines have their own metadata standards so you don’t have to develop a schema from scratch.

Standards

Standards provide the often invisible scaffolding out of which the interoperability and re-usability of digital information is created.

Some examples

  • The Text Encoding Initiative (TEI) maintains an XML format for the semantic and presentational markup of humanities texts. This has enabled the creation of large corpora of texts which can be searched and analysed in multiple ways, such as Early English Books Online, and the Beckett Digital Manuscript Archive.
  • The International Image Interoperability Framework (IIIF) defines several application programming interfaces (APIs) that provide a standardised method of describing and delivering images and image collections over the internet. If images are exposed via IIIF-compliant APIs, any IIIF-compliant viewer or application can consume and display both the images and their structural and presentation metadata. This can facilitate a number of rich interactions, for example, the aggregation of images held in separate web locations, page navigation, deep zooming and image annotation.
  • The Resource Description Framework (RDF) is a standard data model for information interchange via the web. It is a very powerful tool for encoding ‘resources’ created in research – including documents, physical objects, people and places, and concepts – and expressing relationships between them. It is built on existing web standards (XML and URL) and integrates data in separate locations so that they can be processed together. It uses a query language, SPARQL, which enables users to conduct complex searches and queries. An example use of RDF is Pleiades, a community-built gazetteer and graph of ancient places. It publishes authoritative information about ancient places, providing services for discovering, displaying, and re-using the information under open licence. Because the information is published in RDF, it is machine-readable and can be processed for analysis or to create visualisations.
  • The Data Documentation Initiative (DDI) is an international metadata standard for describing the data produced by surveys and other observational methods in the social, behavioural, economic, and health sciences. DDI can be used to create catalogue records for data objects, but also more detailed subject-specific metadata, including codebooks, questionnaires and question banks. The UK Data Service uses DDI to structure records for its dataset holdings.

Computer code

You or a collaborator might write code to generate, process, analyse and validate your research data. This counts as a research output, and falls under the University’s Research Data Management Policy. We recommend that you:

  • apply data management principles to the code
  • follow best practice in management, preservation and sharing of your code
  • manage it using a code repository platform with version control
  • preserve and share it wherever possible

Code repository platforms

Code repository platforms can provide version control, as well as code review, bug tracking, documentation, user support and other features.

Repositories can be either private or public, so that code can be maintained during a closed development phase and released for open use at an appropriate stage.

The University provides a GitLab code repository service.

Other popular platforms that can be used are GitHub and Bitbucket.

Further resources

If you need help with research programming, contact the Hub to discuss your requirements.

Data storage

Use of appropriate storage and backup services for data during a project will provide information security, protect data against corruption and loss, and ensure they can be accessed as required.

The University provides storage services that will meet the needs of almost any Digital Humanities project. Here are some advantages of using these services as your primary storage and sharing solution:

  • Warranted data security
  • Replication in separate data centres
  • Automated backup and file recovery
  • Maintenance and technical support

Avoid using the hard drives of personal computing devices, external storage media or personal cloud service accounts as your primary storage. These count as third-party services, and using them leaves your data vulnerable to damage or loss.

In some cases, using third-party services may even be in breach of the University’s Information Compliance policies and data protection law, for example if you are processing personal information. For the avoidance of doubt, stick to university services.

If you are hosting data or resources on a website, don’t rely on the webserver alone for storage and access to the data. You may lose access to the website: the web host may update the platform, locking you out of your site if you didn’t enable the new plugins in time, or you may simply forget your password.

Ensure that up-to-date master copies of your files are held in your primary storage location. If data are being updated via the website on an ongoing basis, schedule regular downloads to your primary storage, and identify these by date.

Key points:

  • Choose a personal storage solution, ideally provided by the University
  • Make sure all your data are regularly backed up to it, not held for interim periods on an external hard drive or webserver

How much storage?

University OneDrive or Teams account: This should be sufficient for storage and secure sharing of small amounts of data with colleagues inside and outside the University. It will cover most text-based research and projects collecting modest quantities of image, audio and video files.

Collaborative fileshare on the local network, free up to 100 GB: UoR members can request this at no cost. This solution might be appropriate for projects involving large numbers of high-resolution image and multimedia files. Local network storage is more efficient than a cloud solution (i.e. OneDrive or Teams) for processing larger files and high volumes of files.

Bear in mind that this storage can only be accessed by University account-holders – although it is possible to apply for temporary external user accounts on behalf of team members from outside the University.

Research Data Storage service, minimum 0.5 TB (costed in project budget): Very rarely, a Digital Humanities project might need to request an allocation in the high-volume Research Data Storage service. The minimum request is 0.5 TB, and storage is charged for per TB per year, so costs would need to be included in a project budget.

For example, at UoR the Legacies of Stephen Dwoskin project required storage for the filmmaker’s digital archive, including material taken from several hard drives containing large amounts of raw film footage which amounted to about 12 TB of data. See Bartliff, Z. et al (2020), ‘Leveraging digital forensics and data exploration to understand the creative work of a filmmaker: a case study of Stephen Dwoskin’s digital archive’, Information Processing & Management, 57:6 (DOI).

Further resources

For the different storage options available at the University, and information about costs, please read the guidance on data storage.

If you have computing-intensive requirements, custom specifications of CPU, memory, storage and GPU can be purchased from the University on a pro rata basis. Information is available on the Academic and Research Computing Team website.

Organisation

Implementing a logical and consistent system to organise your data files allows you and others to locate and use them effectively, and helps to preserve the integrity of your data.

There are three key elements to data organisation:

  • A filing system: use a logical, hierarchical folder structure to store your files, grouping files in categories, and descending from broad high-level categories to more specific folders within these. Raw data files and confidential information should be stored in separate, designated folders;
  • File naming rules: file names enable files to be identified easily, and can be used to order files and to manage version control. Dates written following the ISO 8601 standard (YYYY-MM-DD) or version numbers included in file names can be used to order files sequentially;
  • A version control policy: this is a system to record changes to files over time. It is essential where files are being modified and shared, as it preserves the provenance of data and allows changes to be rolled back if they introduce errors or files become corrupted. Only users who are authorised to make changes to files should be given ‘write’ permissions, and master and milestone versions should be clearly identified and read-only.

The RDM website provides more information about organising data.

Quality control

Every action performed on data, from the point of collection onwards, presents an opportunity for errors to be introduced. Implementing quality control can help you maintain the integrity and viability of your data, by reducing the risk of introducing errors, and mitigating their impact when they occur. Quality control is both prospective and retrospective.

A useful starting point is to map your data workflows, breaking them down into the individual operations performed to capture, clean, process and transform the data. For each activity, an appropriate quality control procedure can be applied.

Visit our guidance on quality control for more information.

Prospective quality controls

  • Well-prepared data collection limits the scope for the introduction of error.
  • Data collection forms and templates can be set up in advance.
  • Spreadsheet and database software will have data validation functions, which can be used, for example, to specify a controlled list of values that can be entered in a field or to ensure dates are recorded in a consistent format.
  • If you will be entering data into a database or marking up text you can establish rules covering matters such as spelling normalisation, conventions for personal and place names, and so on.
  • You can test protocols for data collection and specify them in clear documentation.
  • Instruments should be calibrated to specifications before data collection begins.

Retrospective quality controls

  • Transformed data can be compared to the source, for example by checking marked-up text against the original document.
  • A data collection or transformation process could be repeated and the results compared.
  • Software applications such as databases may enable you to define and execute data quality rules, or to generate reports.
  • Visualising quantitative data in a graph or chart might help to identify suspicious anomalies.
  • Digital resources can be user-tested to identify bugs and refine the design.

Websites

You may include a project website as an output of your project. A website is an attractive and accessible gateway to resources. It may allow your audience to view and browse digitised images, or provide a customised search/query interface to your database.

Do I need a project website?

Project websites are a useful tool for pointing people to your research. However, digital resources need managing and updating, so you should think about whether it is the best output for you, or whether you are creating a website because you feel it is expected. You can often achieve the same result with a blog post or webpage that is part of existing university webpages, then you won’t have to worry about it going down, or providing enough content.

Here are some questions to think about. You should use these as a basis for discussion with your Research Development Manager and the Research Data Manager.

What do you want a website for? Do you have a clear rationale and plan for management?

  • Is the website to increase the project’s visibility? Remember that once you have created your website, you have to have a Communications plan, and make sure you have activated the settings to make your website visible in search engines. Research Communications team can help with this.
  • Is the website to provide online access to a resource created by, or used by, the project? How long must this be available?

Is it a complex resource (e.g. searchable digital collection)? The Digital Humanities Hub can advise on how to design this – please talk to your Research Development Manager.

If it is not complex, consider using one of the existing public engagement platforms. You could create:

  • a blog post for the main Research website at the end of the project;
  • an ‘online exhibition’ on the Museums & Collections pages.

This will ensure that the website will be managed in the future and its software updated even if the PI leaves UoR.

All websites must be archived at the end of a project (i.e. moved from the active storage server to the archive storage), so make sure you make time for this.

Consider carefully whether a public-facing digital resource is the right kind of output for your funding call. Large-scale digital resources take substantial planning time, and a permanent contract is usually required in order for an institution to underwrite hosting costs after the end of a funding period. If you are applying for a 3- or 5-year postdoctoral or career development project, your time might be better allocated to refining your data collection and preservation so that a sustainable resource can be built in future.

Preservation of websites

The University and most funders suggest that websites remain available for a minimum of 3 years beyond the end of the project.

In many cases the ambition may be to sustain a resource for longer; but this requires commitment, and may incur hosting, maintenance and support costs.

The institution(s) participating in the project may cover these costs, or you might obtain external grant funding (in which case any costs will need to be paid before the end of the grant).

Even if your project website is the main visible gateway to a resource, a website is not a preservation solution by FAIR standards. Websites usually need continual support to remain viable (especially custom resources with bespoke features). Websites do not usually generate structured metadata records or persistent unique identifiers.

You should not try to get round this by committing to pay web hosting costs personally. This is not sustainable, and you won’t have access to technical and maintenance support from the university. As the funding was awarded to the University, there may be Intellectual Property implications, or problems if you leave and the project data is attached to a personal web account. Your circumstances might also change and you might not be able to sustain this. Instead, speak to the Digital Humanities Hub team who can help with preservation solutions.

To ensure you don’t lose your data when your funding for web hosting runs out, make sure you plan ahead for preservation of your data in a repository or archive service. This may not preserve the accessibility or user features that the website presented, and may require breaking the overall resource down into component data files, but it enables data to be retrieved and the resource recreated or further developed in the future.

Preservation and sharing

When completing a project, you will need to secure the long-term viability of your digital outputs. This protects the legacy of your project, and ensures you have met the requirements of the University’s and most public and charitable funders’ policies.

The only preservation solution for digital data generated by a project is deposit of data files into a data repository or accession into an archive service (which may be an option for some digital collections).

The storage solution you used during your project, or a website hosting your resource, are not digital preservation solutions, for the following reasons:

  • Digital data, and the media used to store them, are especially fragile. All data are subject to degradation over time, so storage and backup is insufficient.
  • Files may become corrupted, and the software and hardware required to access them may become obsolete.
  • Websites are unstable and ephemeral: links break, pages are moved, and websites become unusable and eventually disappear if not actively maintained.

What does preservation involve?

Digital preservation is not a single action, but a set of processes undertaken by a service that has a mission to preserve digital objects, and to maintain their accessibility and usability over time.

Data repositories and archives are preservation services. They have staff who are skilled in digital preservation. This involves a number of processes:

  • Acceptance of the digital object in the service following its handover or deposit by the owner or custodian;
  • Analysis and validation of the digital object, to verify that the object meets the service collection policy, appraise its contents, identify any access restrictions, determine its file formats, check files are not corrupted, and generate checksums (digital signatures of file integrity);
  • Organisation and description of the object following cataloguing rules;
  • Transfer of the object into archive-standard storage, and ongoing preservation activities, such as periodic checks for corrupted files by validation of file checksums, and migration of contents when file formats become obsolete;
  • Publication of a discoverable, citable record for the digital object via an online catalogue, including an item identifier such as a catalogue number or DOI, information about its creators, rights-holders, licence terms and access restrictions, and ongoing management of access to the object.

To make sure you preserve your project’s legacy, plan for data preservation at the outset, and factor in time at the end of the project for data archiving. This should be included in the work plan section of your DMP.

Your DMP should also identify the repository or archive service in which relevant data will be deposited.

Data repositories and archive services

Research Data Archive

University members have the option of using the University of Reading Research Data Archive, which will preserve and enable access to data in the long-term. Up to 20 GB of data per project can deposited at no charge. Deposits greater than 20 GB may be subject to a charge and must be agreed in advance. If you intend to deposit more than 20 GB of data in the Archive, contact the Research Data Manager to discuss your requirements.

Domain-specific services

Some disciplines have domain-specific services that are more suitable for the preservation of data in their fields, such as the Archaeology Data Service (ADS) for archaeological data, and the UK Data Service ReShare repository for social science data. You are encouraged to use these where they are appropriate.

You should first check whether there is a cost to use the service. The ADS charges for deposit of datasets, but in most cases data repository services are free to use.

If the service does charge, you will need to obtain an estimate of the likely cost for inclusion in the project budget. If costs are being met out of an external grant, it must be possible for archiving charges to be met within the grant period.

Collections-based research

In some collections-based research, for example where a project involves digitisation of an archive’s physical holdings, online preservation of data may be undertaken by an archive service. These services will catalogue the resources and may have dedicated pIatforms for online search and viewing.

The University’s Museums, Archives & Special Collections services can help you think about the future of a collection used in your research, even if it is not based at Reading. You can get in touch with them on merl@reading.ac.uk.

If you are creating digital resources using the University’s own collections, the Special Collections department may oversee the long-term preservation and access.

For more information, you can refer to the Digital Humanities Hub guide to working with digital collections.

Code

Any code created during your project should be archived to a public data repository, with integrated commentary and any relevant documentation. This will preserve the code and enable a specific version used to generate or analyse data to be cited by DOI.

Expectations and exemptions

Public access

Data should be made publicly accessible wherever possible, no later than publication of any findings that rely on them.

As a rule data should be made freely accessible, and research data repositories do not charge for access to data.

If there is any intention to charge for access to data, this must be justified, for example by the commercial aims of the project or the high cost of providing access.

Referencing your data

It is good practice, and in some cases a policy requirement, to reference supporting data held in repositories from related publications by means of a data access statement, sometimes also called a data availability statement.

For example, the UKRI Open Access Policy ‘requires in-scope research articles to include a Data Access Statement, even where there are no data associated with the article or the data are inaccessible’.

Where possible, you should use a DOI (or other unique identifier) to reference your data in the access statement.

The statement should specify any access restrictions that apply to the data. We provide guidance on data access statements.

Further guidance on citing humanities research data is provided by Routledge Open Research.

When you can restrict access to data

Exemptions from the data sharing requirement may apply where there are valid legal, ethical and commercial reasons.

In most cases, data collected from research participants can be shared, so long as an ethical basis for doing so has been established (see below). Personally identifying and confidential information can be removed to anonymise or de-identify data and so make them safe for sharing. See our guidance on anonymisation.

Anonymised data collected from participants that may be at higher risk of re-identification and data containing personal or confidential information may be shared on a restricted basis using some repository services, including the UK Data Service ReShare repository and the University’s Research Data Archive. See University guidance on archiving restricted data.

It is acceptable to restrict access to data if they are commercially confidential or there is a commercial pathway for the research. If IP protection is sought, it should be possible to release data once protection has been confirmed.

Intellectual Property and licensing

You should establish the facts of intellectual property ownership before you start your research, as these will have a bearing on what you can and cannot do with any data you collect.

This is important in DH research, where existing intellectual property may be used as inputs into the research and reproduced in digital outputs. For example, if your research involved digitisation of in-copyright manuscripts or artworks, publication of the digital images and reproduction of text in a transcript without permission would be a breach of copyright.

Even if older material is no longer in copyright, there may be copyright in photographic images of the material: the test (in case law) is whether the photograph is a copy or includes a creative addition. It is safer to assume that there is copyright in the photograph.

Databases (in the legal sense of any selection and arrangement of data) also attract copyright. If you intend to reproduce substantial parts of an existing database or collection in your own digital output, consult the University’s Copyright Adviser.

Permission to use copyright material may be granted under the terms of a licence associated with the material. This is likely to be the case for material held in data repositories and archive services.

There are a number of standard open licences in common use that grant a set of defined permissions to make use of copyright material, such as:

Copyright for interviews and other participatory research

Research participants providing qualitative data, such as spoken words in an interview, photographs or other works created by themselves, will hold copyright in these materials.

You can ask participants to transfer copyright in the materials to the University, or to grant the University a licence to use and publish the materials.

  • Transfer – A transfer of copyright to the University confers freedom to publish the material on any terms, but this must be balanced against the rights of participants.
  • Licence – If the research participants do not wish to transfer copyright in their materials, or if requesting transfer of copyright is not considered appropriate, a licence grant to the University is acceptable.

In all cases you should follow the consent process and make sure all participants understand what they are agreeing to.

The UK Data Service model consent form provides an example of seeking transfer of copyright in the consent process. See also the section on Consent below.

Your authority over the licensing process

You can only choose the licence to be attributed to a work where you are the copyright holder, or you have a delegated authority on their behalf. If your outputs include or incorporate existing copyright material, then licensing options may be constrained by existing licences.

If you create new intellectual property in the course of employment at the University, the copyright belongs to the University.

In conformity with the University’s Research Data Management Policy and Open Research Statement, members of the University have a delegated authority to publish data and other outputs under an open licence, unless there is a legal, ethical or commercial reason why this is not possible or appropriate.

You should also take into account the rights of co-creators of intellectual property (or the organisations by which they are employed), including students, employees of other institutions, and members of the public involved in participatory research.

Licensing for re-use

You should licence data and digital resources under open and standard licences where possible, on the principle: ‘as open as possible, as closed as necessary’. Creative Commons Attribution is a good standard default.

Re-use of data is a topic that requires considerable thought. You should weigh up the possibility of misuse against the effects on the value and impact of your resource.

Sometimes people are uncomfortable with the idea of permitting others to use their work for commercial purposes, and want to apply non-commercial or no-derivatives terms.

Legitimate re-use of materials occurs in text and data mining activities. The risk of misuse through exploitation of derivatives has traditionally been quite low.

The advent of AI technologies has changed this – if you are concerned about the use of your data for Machine Learning or AI models, the university’s AI Community of Practice is a good starting point for discussion.

Most data repositories include licence fields in their metadata, so that terms of use are clear and machine-readable. If you are exposing content via a website, you should make sure licence statements are clearly visible and machine-readable.

The Creative Commons License Chooser tool provides HTML code that can be embedded in a web page to mark the work.

Research ethics and data protection

If you are collecting data from research participants or relating to living individuals, you will need to consider how personal data will be managed in accordance with the requirements of the Data Protection Act, and how confidentiality will be guaranteed.

You should only collect the minimum personal and confidential data necessary to answer your research questions, and ensure these are stored safely and shared securely only with authorised persons for the stated research purpose.

You should also consider how you will maximise opportunities for sharing data derived from research participants, e.g. by anonymisation.

Sharing of data collected from research participants can only take place if a valid ethical basis is established. This usually happens when ethical approval to conduct the research is sought and during participant recruitment.

It is often possible to make anonymised data available to others. In this case, you should make your intentions clear to participants during the recruitment process, and explain that the anonymised data will not be traceable back to them.

Qualitative data and personal data about living persons

Qualitative data includes things such as video or interviews, where the value of the material is bound up with identifiable information.

With this type of data, anonymisation is not always possible or practical. Seeking consent for public sharing is one option, but this is not a strong basis for long-term preservation and sharing, as consent may be withdrawn at any time.

Safeguards are another option for preservation and sharing of personal data. Data protection law permits archive services to process personal data on a continued basis where the data are held for public interest archiving, scientific or historical research. This means that a University or other public interest organisation may undertake archiving of personal data without the possibility of it being withdrawn.

Access to personal data held in archives for scientific, historical or statistical research purposes may be possible within safeguards (for example, under a data use agreement) and taking into account an assessment of the data subject’s right of privacy.

The University’s Research Data Archive offers a restricted access option for archiving of higher-risk anonymised and identifiable data; and the UK Data Service ReShare repository has a ‘safeguarded data’ option for archiving of higher-risk anonymised data.

If you are working with archival or pre-archival materials that contain personal data relating to living persons (for example, unpublished manuscripts or photographic works) you will also need to consider the data protection implications before you begin the work. You may not be able to make outputs including this information publicly available, and even if they can be archived, access may be very restricted.

Any data repository or archive service which takes on the material will need to be equipped to do so. For example, not all data repositories will archive datasets containing personal data, although, as mentioned above, the University’s Research Data Archive is one repository that can accept such datasets.

We provide guidance on research ethics and data protection. We also suggest you revisit the University’s mandatory UoRLearn course on Data Protection and Information Security.

Incorporating existing outputs

You should use your information sheet and consent form to obtain informed consent.

For more information:

In seeking consent from participants, if you intend to make the data accessible in anonymised form, make sure you avoid making assurances, such as a time limit on retention of the data, or a statement that data will be destroyed at the end of the project, or not shared outside of the project. These undertakings are not necessary for it to be lawful to collect the data, but you will be held to them and they will prevent you from making the data accessible. This may cause problems if you have obtained funding on the basis of publishing your data.

Summary

  • Data management begins with the conceptualisation and planning of research and should be integrated into the research methodology. Creation of a data management plan should go hand in hand with development of a research proposal.
  • Take a whole-lifetime approach: consider not just management of data during the project, but long-term sustainability of resources and underlying data. It is vital to plan for long-term preservation and access, and to use appropriate preservation infrastructure, e.g. data repositories and archive services. Project websites provide enhanced access to resources but are not preservation solutions and should not be treated as such.
  • The FAIR Principles should be integrated from the start: open standards, standardised technologies and procedures should be used where possible. Data modelling should be well-documented and use formal, accessible language for knowledge representation, persistent identifiers, open standards, generic user interfaces and rich metadata.

Next steps

We encourage you to use this guide as a starting point for discussion, rather than a comprehensive solution.

If you are undertaking a Digital Humanities project, your Research Development Manager will ask you many of the questions discussed here.

However, the best Data Management Plans are designed as early in the project as possible.

For advice more closely tailored to your project, contact the Research Data Service. You are also welcome to book a one-on-one data management consultation.

Further reading

ALLEA (2020), ‘Sustainable and FAIR data sharing in the humanities: recommendations of the ALLEA Working Group E-Humanities’. Berlin. https://doi.org/10.7486/DRI.tq582c863

Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. (2016), ‘The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. https://doi.org/10.1038/sdata.2016.18. For more guidance on the Principles, visit the GoFair website: https://www.go-fair.org/fair-principles/.

Munoz, T. and Flanders, J. (eds), ‘DH Curation Guide: a community resource guide to data curation in the digital humanities’. https://archive.mith.umd.edu/dhcuration-guide/guide.dhcuration.org

Posner, M. (2015), ‘Humanities data: a necessary contradiction’. Blog post. http://miriamposner.com/blog/humanities-data-a-necessary-contradiction/

Routledge Open Research, ‘Accessible data and source materials guidelines’. https://routledgeopenresearch.org/for-authors/data-guidelines

Schöch, C. (2013). ‘Big? smart? clean? messy? data in the humanities’. Journal of Digital Humanities 2, no. 3. http://journalofdigitalhumanities.org/2-3/big-smart-clean-messy-data-in-the-humanities/

The National Archives (2018), ‘Guide to archiving personal data’. https://www.nationalarchives.gov.uk/information-management/legislation/data-protection/

The National Archives, ‘Digital preservation workflows’. https://www.nationalarchives.gov.uk/archives-sector/projects-and-programmes/plugged-in-powered-up/digital-preservation-workflows/.