Working with Collections

Starting Points

If you are a researcher working with physical collections, there are ways in which you can engage digitally in your research. Collections and Digital Humanities skills can overlap, enabling you to view your sources in exciting ways that cannot be achieved through traditional methods of analysis . This resource acts a starting point for you to gain an understanding of how you can engage with Digital Humanities in your research. It contains areas which will help you to develop the questions that you might have when approaching the Digital Humanities Hub team and/or Special Collections.

You may have experience with collections-based research but wish to engage with Digital Humanities more. You may be proficient with Digital Humanities technologies, depending on your discipline. Or, you may be new to working with collections and Digital Humanities. Whatever your level of knowledge, this resource will help you to:

If you are interested in using a collection from a particular library, museum, or archive, it is important that you contact relevant personnel at the earliest point in your project. This is so you can receive the support and advice you need to understand how you can undertake your research.

What is the difference between a collection and research data?

A collection refers to a group of cultural artefacts related in some way (e.g. through ownership, subject or theme), which is preserved and made accessible. Collections are usually housed by institutions such as a museum, library or archive, for others to use and enjoy.

Research data are information collected or created by a researcher or research team for the purpose of answering a research question.

Understanding the difference between these terms will help you to work out what provisions you need for the information that you will gather.

In any collections-based digital humanities research, a collection will be used to generate research data either by transformation of source material (digitisation, transcription and markup) or its analysis (e.g. creating a database using information extracted from archival records). Consider the type of output that will be created as a result of your project and how it will be preserved long-term. If the output is a collection in its own right or an extension of an original collection (e.g. digitised images, searchable transcript of a source archive), it may be suitable for accession into an archive or another service as part of its online collections. Research data may be more suitable for deposit in a data repository. If you feel that the information that you will gather aligns with research data, you should consider the guidance provided for research data.

What are the ways in which digital research and collections can overlap?

Collections can contribute to any research project with extant physical versions of the primary sources, but most collections-based research projects will involve one or more of these areas, as well as digital methods of analysis and visualisation. This list is by no means exhaustive and focuses on those where the physical collection features in the end resource.

See examples here

Textual Analysis

The digitisation of collections enables large-scale analysis of their contents via digital analysis techniques such as corpus linguistics and natural language processing. Collections which have rich databases help researchers to select these corpora. Textual analysis can show patterns which contribute to the judgement of style or authorship.

Example: The Quill Project is a research platform which works to research the history and enhance understanding of some of the world’s foundational legal texts with a focus on the US constitution.

Digital editing

This includes critical editions, and also “genetic” editions. These are particularly useful for texts where a single print edition would obscure huge variations. Digital editing methods and interfaces can highlight the role of the editor in history, from an individual editor’s choices to the moral standards in the editing industry.

Example: The Darwin Correspondence Project locates and researches letters written by and to the evolutionary scientist Charles Darwin. It publishes complete transcripts together with contextual notes and articles.

Book history

Digital approaches have been applied to illustrating all elements of the book trade. This can include circulation, ownership history, sourcing materials required in a book, and the network of individuals involved in its publication. These networks can then be visualised in interactive ways.

Example: 15cBOOKTRADE visualises the distribution, sale, and reception of books in the Renaissance.

Machine learning for object and character recognition

Machine learning approaches explore the potential to train computers to read handwritten text, or to identify similar objects across multiple collections. “Optical Character Recognition” (OCR) is not yet advanced enough to deal with most older handwriting. However, this approach can be very useful for comparing a computer’s results with patterns identified by earlier scholars, on which most subsequent research was based. Handwritten Text Recognition has made greater advances, as an AI can be trained to read certain handwriting. Despite, this handwritten technologies still require human verification.

Example: Transkribus is a comprehensive platform for the digitisation, AI-powered text recognition, transcription and searching of historical documents.

Performance and film studies

Some scholars of performance use digital methods to (re)create performance spaces, e.g. analysing a film scene by recreating its space to show that it was a composite of shots/locations. They also use digital technologies to analyse the object of performance, e.g. by performing from a digital score put together from several different drafts.

Example: CineMetrics is a software program that was developed so it is possible to produce data on the average shot lengths of films.

Hyperspectral imagery

Hyperspectral imagery can supplement existing knowledge about collection items, as it can increase the readability of damaged documents and those with faded inks. The technique can help with understanding object structure, facilitating identification of pigment, varnish or dye; underdrawings; past conservation treatments and deterioration of materials.

Example: The Codex Zacynthius project uses multi-spectral imaging and high-resolution visible light photography to study the palimpsest and transcribe the important commentary and biblical text of Luke.

3D reconstruction, modelling, printing

Objects can be 3D printed to enable handling and understanding of their structure. Data from collections are often brought together to contribute to 3D reconstructions of ancient sites. This is an area where the relationship between scholarly interpretation and the output is particularly obscured, and the addition of narrative explication is important.

Example: Digital Giza uses archaeological records to build 3D reconstructions of the Giza Pyramids and surrounding cemeteries and settlements.

Current Projects at Reading

The following projects are examples of digital projects that made use of UoR collections.

See current projects here

Modernist Archives Publishing Project (MAPP)

A critical digital archive of early 20th century publishers, which began with Leonard and Virginia Woolf’s Hogarth Press. There are currently over 2000 artefacts on the site, including unique dust jackets, author and publisher correspondences, reader’s reports, printing and production papers, illustrations, and born digital biographies of people and presses.

The Legacies of Stephen Dwoskin’s Personal Cinema

A 3-year interdisciplinary AHRC project to preserve and promote the material in the archive of the acclaimed experimental filmmaker, director and choreographer. The archives include paper documents of all kinds, computer hard drives, audio tapes, video cassettes, and thousands of photographs, slides, and negatives-plus many posters, paintings, and designs. One of the tasks of the project is to assist in the cataloguing of this material for the benefit of researchers.

The Field Project

This project funded by The Wellcome Trust involved the digitisation of 3,525 records including contact sheets and negatives. 293 films from the Ministry of Agriculture’s film library were digitised and made available on the University’s Virtual Reading Room. As a result of copyright research, the MERL is now able to submit the Ministry of Agriculture films into the European Rural History Film Database. The MERL also ran a crowdfunding project to rehouse some of the negatives/contact sheets that were catalogued.

Define Your Research Audience and Outputs

It may be helpful to define the resource you will be creating as a result of your project. Will the resource extend or enhance the collection? Consider who exactly will be the audience of your research output. Will they be fellow researchers, professionals or the wider public? How will you make your research accessible to your audience? Your audience’s ability to access your research will affect where it will be hosted. Will it be hosted by the collection host or be a standalone resource on a dedicated website? Will datasets/underlying data files be deposited in a data repository?

If you are producing a web resource which is intended to be accessed by a wider public than fellow researchers, it needs to be able to function on a range of devices (e.g. phone, tablet). However, especially in the case of research that has an impact audience or involves work with vulnerable groups, you must further bear in mind that not everyone will have access to an Internet connection or an appropriate device at all. To address this, it may be necessary to have provisions. For example, the host archive may provide onsite terminals for access.

If you are creating a resource primarily for fellow researchers, you may need to balance your available funding with the principles you want your resource to embody. We advise you to make your outputs as open as possible and closed as necessary. If an important part of your project is to make collection materials accessible to scholars who cannot travel to the collection, your resource should not be behind a paywall; this would render it inaccessible to precarious and independent scholars who do not have access via an institutional subscription. Charging for access should very much be the exception and only where justified (e.g. conditions of use of third party copyright material).

Many collections were created with biases on the part of both collectors and archivists. These biases need to be identified and considered for the digital versions- for example, with the terminology used for metadata categories or in the selection of material for digitisation. You may find that the digitised resources you are using are reproductions of card catalogues. You have an editorial responsibility to the source, with biases forming an important part of the historical record and an object of study for your project. Whilst authenticity is important, copying exactly how the resource was designed may not enable greater access. For example, the information in card catalogues was probably recorded for purposes that may no longer be applicable to your current audience, relevant information may have been left out.

Further information on how to engage with the public in your research can be found on the website for the National Co-ordinating Centre for Public Engagement.

Using University Collections

The advice in this section is particularly relevant if you are interested in using the University’s Museum and Collections. The Special Collections team can support you and you are encouraged to contact them in the preliminary stages of a project. Before you approach Special Collections, it may benefit you to consider how you will engage digitally with the collection. This may involve considering areas such as:

  • Digitisation (transforming analogue material into a digital format)
  • Metadata (information attached to a resource)
  • Permissions and Licensing (many collections will have permissions for use)


This is any form of transformation of analogue material into a digital format. The most common types of digitised material include:

  • Manuscripts or printed textual sources
  • Visual materials (paintings or photographs)
  • Temporal media (audio recordings or films)
  • Spatial media (archaeological artefacts, museum objects or sculptures).

Digitisation can enable access to primary sources without the need to refer to the original artefacts. It is a long-standing component of archival preservation and research. Digitising material can also facilitate analyses which cannot be achieved with the physical resource, as techniques like multi-spectral imaging can reveal the composition of pigments and inks. Understanding the nature of your materials is critical to avoid the least damage, as digitisation may not always be appropriate for a collection.

Text transcription is the most common means of digitising manuscripts for display on a HTML website. This is often enhanced through textual mark-up such as the Text Encoding Initiative (TEI) markup, which can enable you to capture extra information about the text, such as semantic values. The markup is highly interpretative and depends on the types of research questions that you have. TEI can also be used to accomplish search and retrieval tasks, allowing you to search multiple texts at once. The outputs of TEI projects can include:

Transcriptions are often produced manually. OCR and handwriting recognition software can be used, but will require human intervention due to their limitations.

With text transcription, it is useful to consider what should be transcribed and who will be doing this. Will you be paying a research assistant or similar? This is something that will need to be factored into your project plan.  Transcription can be an intensive process, so crowdsourcing could be a practical option. Crowdsourcing can engage with diverse communities, enabling you to incorporate different perspectives into transcriptions. However, the success of a crowdsourcing project is dependent on the level of engagement your volunteers have and how much time they commit.  If crowdsourcing is a serious possibility for your project, the Digital Humanities Hub can help you to implement this. The Transcribe Bentham project is an excellent example of crowdsourcing. The project engages the public in the online transcription of original and unstudied manuscripts written by Jeremy Bentham, his correspondents and his scribers.

It is helpful to consider not only what will be gained through digitisation but also what will be lost. Digitisation creates a digital surrogate which can only capture aspects of an original. The manner in which you digitise will create decisions regarding what will be captured and discarded. Strong Digital Humanities projects include documentation of, and critical reflection on, these decisions.

If you are strongly considering digitisation as part of your project, you should contact UMASCS in the preliminary stages. UMASCS has expertise and infrastructure to enable a wide variety of artefacts to be digitised in a safe, secure and cost-effective way. If UMASCS cannot help directly with digitisation, they will be able to advise on standards and processes.

Starting Points:

  • What benefits will digitisation have for your project?
  • Is it appropriate for the type of materials that you are using?
  • Who will be carrying out text transcription?
  • What will be gained and lost from digitisation?
  • Have you contacted staff at the institution which holds your collection?
  • Where will the digitised resource/collection be maintained long-term?
  • Will the holder of the physical collection host this?


Metadata is the information that you will attach to a resource. It consists of information that you would like to share about it, e.g. who made it, when it was made, the format it was captured in, etc. Metadata can allow you to make sense of huge volumes of complex data. It is important for the machine-actionability of resources (e.g. XML, IIIF for images). Whilst it can appear to be objective, metadata is actually subjective as it will depend on the choices made by the creator. Metadata can vary depending on a person’s motives, beliefs, opinions, and scholarly discipline. Therefore, standards can vary as original texts can be described in different ways. Standards can also change over time which affects the retrieval of items.

At the start of your project, it is important to have an understanding of the nature of the metadata and the quantity needed. Metadata requirements will vary significantly depending on the nature of the output. If the output is a collection, such as an XML text corpus or collection of images, than the production of metadata could be more labour-intensive than a straightforward database. Having a clear understanding of your requirements will further help in costing your project and seeking funding. You should consider who will be producing the metadata, and bear in mind that limited long-term cataloguing funds make it difficult for most institutions to add metadata at a later date, as they will be prioritising collections which are not digitally available at all. If possible, you should incorporate it into your project plan, and consider including funding for a dedicated project archivist or to buy out the time of current staff.

If your collection has been catalogued already in a physical format, think carefully about whether you want the metadata to reproduce this catalogue. It may have been created with different purposes in mind from yours, and consequently different categories of data will have been recorded.

These are critical aspects that must be evaluated before beginning the project. The DH Hub can help you assess the staffing and resource costs for the creation of your resource. The important thing to remember is that you should make contact with staff at the institution that holds your collection as soon as possible in the planning of your project. UMASCS has expertise in managing all kinds of collections-related metadata. In some cases, they may be able to provide a solution using their existing infrastructure or some adaptation of it. In most cases, they will be able to advise on standards and processes.

Starting Points:

  • What metadata are necessary/relevant for the purpose of the project?
  • Who will be producing the metadata?
  • How will you add metadata at later date? Will you employ someone to do this?
  • Have you contacted staff at the institution which holds your collection?

Permissions and Licensing

Before you begin a project on an existing collection, you need to ensure that you obtain all relevant permissions to use it for your intended purpose. You should consider two levels of use separately: what you want to do with the material, and what you want others to be able to do.

Many collections are still under copyright, or are available under a specific licence, which could affect the extent to which your research is publicly available and shareable. If you want to work with material under copyright, or to do more than the existing licence allows, you will need to apply for permission. Remember that services which hold collections may not necessarily hold copyright for everything. You will need to first establish that permission for the required purpose is obtainable and if relevant, at what cost. This will help you to determine if the scope of your project is viable which may need revising as a result of this investigation. When writing your proposal, you will need to allow sufficient time to obtain permissions from relevant third parties.

Even if material is out of copyright, it is still important to consider the conditions of access that you will be offering to future users. These exist on a spectrum. For example, a donor could be willing to fund digitisation for an access or licence fee whereas a heritage or education institution will have principles of openness, transparency, and accessibility.

In summary:

  • Make sure that you have fully investigated the copyright status of the collection you would like to work with, so that your project plan can adapt.
  • Consider in advance what you want potential users to be able to do with your data and make sure the permissions that you obtain, and/or the licensing that you apply to your digital materials, reflect this.

Starting Points:

  • Have you fully investigated the copyright status of the collection you would like to work with?
  • Have you obtained all relevant permissions to use it for your intended purpose?
  • Have you allowed enough time to obtain permissions in your project plan?
  • What are the conditions of access that you will be offering to future users?

Using Collections Elsewhere

Similar to using the University’s Special Collections, for an external collection you need to consider how you will engage with it. Once again, it is useful to consider areas such as digitisation, metadata, as well as rights and licensing.

There are strong ethical frameworks whereby heritage professionals seek to work collaboratively to ensure the best outcome for heritage items, but this can be a complex area. UMASCS can provide advice on engaging with external collections of all types, in strict confidence and in a spirit of collaboration.

Many Digital Humanities projects are inherently disciplinary and therefore require collaboration. This can provide many strengths to your projects as you may have greater access to expertise and a range of disciplinary perspectives. However, collaboration can pose several challenges that you need to anticipate:

  • Institutional priorities. Many collaborators will have their own institutional priorities in terms of engaging with your project. These may or may not align with your own and your institution’s motivations.
  • Permissions regarding the use of material. Your project may involve collaborating with an institution because you share collections. You need to ensure that you have the permission to use the other institution’s collections and understand the extent of this use.

New Collections

This section will consider how you should approach creating a potential new collection. First, you need to determine whether the information you will gather for a project will constitute a collection or research data. As defined at the start of this resource:

A collection refers to a group of cultural artefacts related in some way (e.g. through ownership, subject or theme), which is preserved and made accessible. Collections are usually housed by institutions such as a museum, library or archive, for other to use and enjoy.

Research data are information collected or created by a researcher or research team for the purpose of answering a research question.

If it is a collection, it can be helpful to consider aspects such as:

  • Born-digital material (material that has never existed in any other form than digital)
  • Digital preservation (how you will ensure the sustainability and storage of your project over time)
  • Ethical considerations (collecting material related to living persons).

The University has formal processes for adding new material to its museums and collections and this will sometimes involve making a case to the Collections Governance Committee. If you are strongly considering a new collection, please discuss this with UMASCS in the first instance.

Born-digital material

Born-digital material is material that has never existed in any other form than digital. This can include:

  • Emails and texts
  • Digital documents e.g. Word documents, PDFs
  • Social media outputs (e.g. Tweets) as well as data about those (e.g. analytics)
  • Digitally formatted media including photos, videos, audio
  • Files from computer software

Archives are now increasingly accessioning born-digital collections. This ties into the emergence of humanities “big data”, as born-digital collections can create issues surrounding volume. For example, the archive of a Prime Minister’s email inbox or a collection where a US President’s tweets have been gathered. These examples can contain thousands of documents which are too large to be read and navigated by humans. It is helpful to consider how you will store the information and what search and retrieval tools you will use. This ensures that your collection is searchable and accessible so that meaning can be inferred from it. You may want to consider creating or using an online repository for your born-digital collection.

Starting Points:

  • How will you store born-digital information?
  • What search and retrieval tools will you use so that your collection is accessible?
  • Is creating or using an online repository appropriate for your project?

Digital Preservation

Digital heritage is equally at risk of loss or damage as physical heritage, if not more so. The University of Reading has a Policy for Museums and Collections that recognises the wide variety of formats in the collections that the University holds and commits the University to long-term planned stewardship and the adoption of standards that will secure their long-term preservation and accessibility. Online preservation of data may be undertaken by an archive service which will catalogue resources and have dedicated platforms for online search and viewing. If you are creating digital resources using the University’s collections, long-term preservation and access to the digital resources may be undertaken by UMASCS.

It is important that you factor plans for preservation into your projects. Consider how long funding will last for your project. Many institutions will have limited budgets for long-term digitisation, affecting the longevity of the project. Consequently, they will rely on short-term project funding which can lead to available material not being representative of the collection, lots of individual siloed project websites, and images which do not have associated metadata.

It is important that your preservation solution is discussed with the service that will host the collection. The Digital Humanities Hub, in conjunction with UMASCS, can help you consider the following aspects regarding the preservation of your digital projects:

  • How the project can maintain material over time. If the output is a collection, you should aim for it to be accessioned into an appropriate archive service. For long-term sustainability, it is necessary to deposit data in a data repository. Think about creating several copies, in different physical locations and using different storage technologies.
  • How the project will adapt to changing technologies. Obsoletion occurs when new technologies develop but older technologies are left behind and support for them ends. This can happen for both hardware and software, as operating systems change. A widely-cited example is the 1986 BBC Domesday Project which digitised the Domesday Book, an 11th century census of England. This project was stored on adapted LaserDiscs which over time created preservation issues as drives capable of accessing the discs were scarce nearly 20 years later. To avoid obsolescence, consider how your collection can migrate to be supported by current software and what equipment is needed to access it. This is vital to avoid the issue of your collection becoming unusable.
  • How fragile your material is. It can be surprising to learn that digital collections are often more fragile in nature than analogue ones. Digital files can be easily lost through technological failure, corruption or deliberate alteration. Storage media can decay over time, with the consequences only obvious once the damage has been done. You will need to review your storage every 3-5 years and move content onto new storage. For further information, you can refer to the University’s Research Data Management guidance.
  • What file formats you will use. Determining the most appropriate file formats for your project is critical in terms of storage and access. Adopting widely used formats like CSV will mean that materials collected are likely to endure and continue to be supported. Whereas, other formats which have been developed to be used on specific hardware may encounter issues regarding longevity. For example, Scrivener software which was developed for Mac use only. You many need to weigh up the advantages and disadvantages of different file formats, to ensure that your project is sustainable. JPEGs are small files, allowing for vast storage and quick transfer. However, since the files are compressed quickly, this can result in a loss of quality. TIFF files can be more appropriate to use when preserving images for a long period of time as they are uncompressed. Nevertheless, this results in bigger files which requires more storage space. They will also take longer to load and transfer.

For further guidance, the National Archives has information on digital preservation workflows.


If it is a non-collection resource or research data, then hosting it on a project website would be appropriate. A minimum of 3 years for preservation post-project would be desirable. You must keep in mind that a website is not a preservation solution, as it is ephemeral in nature and require continual support to remain viable. The webserver cannot be relied upon for primary storage and access to the data. Ensure that up to date master copies of your files are held in your primary storage location.

Starting Points:

  • How long do you have funding and what can you feasibly achieve?
  • Who will perform updates to maintain the project in the long-term?
  • How will your collection migrate so that it can be supported by current software and hardware?
  • How many copies will you keep of your documents?
  • What storage mediums will you use?
  • What file formats will you use?

Ethical Considerations

You need to be aware of ethical responsibilities when you are creating and using collections. Consider who owns the materials within the collection and whether you have sought permission to use these. This relates to the previous Rights and Licensing section that you must consider when engaging with an existing collection. The permissions associated with materials can affect the extent to which you can use and share them. Ethical approval is also needed for research involving human subjects, samples or personal data. For further information on how to obtain approval from the University’s Research Ethics Committee, you can consult this website.

There may be ethical challenges regarding terminologies, assumptions and unconscious bias within the material that you will be gathering. For example, if a collection is based on a particular community, it is important that you engage with them. Not only will this help to increase the quality of your research project, it will also help you to approach material in a culturally sensitive way. Be aware that there are different terminologies for communities and make sure that the terms you use are socially conscious and inclusive. For further information, consult the University of Reading’s Diversity website.

You may be collecting material which is associated with living and deceased persons. If this is the case, you need to ensure that the material has been screened for sensitivity issues. You should consider who will be screening the data, as you may need to allocate funding for this. Participants must consent to participate and can reserve the right to withdraw from your project. If data is collected from them, participants can only request removal of their data unless they have withdrawn from participation and their data has not been anonymised. Personal data in research is collected on the public task lawful basis which does not give the data subject the right to withdraw their personal data.

Participants should also understand their rights over their personal outputs, such as interviews and work produced. An individual can reserve the right to remove their personal outputs, such as interviews and work produced. For deceased persons, you may need to seek permission from living relatives. If participants produce copyright material and grant a licence to use it or transfer copyright to you, then they do not have the right to withdraw that material. For further information, you can consult the Information Commissioner’s Office.

It is important that in the consent form, participants understand what they will be contributing to a project and how their personal information will be used, shared and stored not only during the project but in its aftermath. This is a requirement of the UK General Data Protection Regulation (UK GDPR) under the Data Protection Act 2018 . If you are asking participants to contribute material, you may need to describe what will happen to this and who will access it. For an example of how to produce a consent form, consider the Children’s Landscapes project. You can also see guidance on participant information sheets and consent forms from the University’s Information Management and Policy Services (IMPS). For further data protection issues, you should consult resources from IMPS.

If you are planning to conduct surveys for a project, the University has guidance for a range of online survey tools, including Online Surveys, Qualtrics and UoR REDCap.

Starting Points:

  • Are you collecting material which is associated with living and deceased persons?
  • Have you communicated how you will be using, storing and sharing personal information with your participants/ their relatives?
  • Who will be screening material for sensitivity issues?

Note: This page was written by Chloe Bolsover, former Senior Library Assistant (Research Engagement). Updates by Guy Baxter (Associate Director of Archive Services) and Olivia Thompson (Digital Humanities Officer).