{"id":2110,"date":"2024-09-06T10:00:58","date_gmt":"2024-09-06T09:00:58","guid":{"rendered":"https:\/\/research.reading.ac.uk\/digitalhumanities\/?p=2110"},"modified":"2024-09-03T10:17:15","modified_gmt":"2024-09-03T09:17:15","slug":"developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example","status":"publish","type":"post","link":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/","title":{"rendered":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example"},"content":{"rendered":"<p style=\"text-align: center\">by Abdulrahman A. A. Alsayed<br \/>\nUniversity of St Andrews, United Kingdom; King Faisal University, Saudi Arabia<\/p>\n<p>Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows. Research teams can explore utilizing AI-based agents that they train to help delegate various components of the research pipeline. Developing these agents involves a process of fine-tuning a general-purpose foundation model to power them with the capability of handling specialized research tasks and to work autonomously. Such agents have proven effective in assisting with the reduction of demands in labor-intensive tasks in general, both within research contexts and beyond (Turow, 2024; Cheng et al., 2024 for an overview).<\/p>\n<figure id=\"attachment_2113\" aria-describedby=\"caption-attachment-2113\" style=\"width: 384px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2113 size-full\" src=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art.png\" alt=\"\" width=\"384\" height=\"384\" srcset=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art.png 384w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art-300x300.png 300w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art-150x150.png 150w\" sizes=\"auto, (max-width: 384px) 100vw, 384px\" \/><figcaption id=\"caption-attachment-2113\" class=\"wp-caption-text\">Concept art generated with Stable Diffusion. Prompt: &#8216;a central AI bot depicted with a marker in hand busily working with a volume of spreadsheets&#8217;<\/figcaption><\/figure>\n<p>Linguistic work based on empirical studies of corpora often requires highly annotated datasets tagged with specialized linguistic information, which can prove to be labor-intensive to produce manually depending on a project&#8217;s scale. In my current project, code named CEERR (the Communicative Events Extended Reality Repository), I seek to establish a platform for documenting and analyzing the natural syntax of spoken language in modern varieties produced within immersive spatial environments. One of the aims of developing this platform is to achieve seamlessly incorporation of linguistic corpus tools, AI APIs, and virtual reality (VR) immersive experiences used as visual stimuli for speakers into one ecosystem. The project&#8217;s motivation is to enhance language documentation methods to fully capture communication environments and facilitate collaboration between speech communities and language experts in simulated virtual fieldwork settings. One of the project&#8217;s important outputs is the production of high-quality annotated datasets with audiovisual samples for low-resource language varieties that have seen little documentation work. Alsayed (2023) provides an overview of the project&#8217;s proof-of-concept and its data collection methodology.<\/p>\n<figure id=\"attachment_2116\" aria-describedby=\"caption-attachment-2116\" style=\"width: 384px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2116\" style=\"letter-spacing: 0.08px\" src=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/participant-responding.jpg\" alt=\"\" width=\"384\" height=\"449\" srcset=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/participant-responding.jpg 396w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/participant-responding-257x300.jpg 257w\" sizes=\"auto, (max-width: 384px) 100vw, 384px\" \/><figcaption id=\"caption-attachment-2116\" class=\"wp-caption-text\">A participant responding to visual stimuli during virtual reality-mediated linguistic fieldwork<\/figcaption><\/figure>\n<p>Audiovisual corpora of spoken language and corpora of low-resource languages are especially challenging to compile into abstractable datasets. Audiovisual corpora are inherently data-rich and encapsulate textual, auditory, and visual modalities. Low-resource languages inherently lack reference corpora and dictionaries that can facilitate their annotation. Combined, creating and annotating these types of datasets can be a significant bottleneck, especially when large-scale annotation work is required. One technique I explored to alleviate the hurdle in textual annotation was fine-tuning large language models (LLMs) to tag text with linguistic annotations (Naveed et al., 2023 for an overview of fine-tuning techniques). The fine-tuning process involves prompt engineering system instructions, utilizing in-context learning, and preparing demonstration datasets tailored to the desired tasks throughout the project&#8217;s stages. The emerging classification abilities of LLMs make them excellent candidates for training as linguistic annotators of corpora through these fine-tuning techniques.<\/p>\n<p>In August 2023, manual annotation work on a sample documented in the CEERR project resulted in the creation of a prototype system to assist with the automated linguistic tagging of further documented samples. The prototype features LLM functionality powered by the LangChain framework, and a graphical user interface built with the PyQt5 library. The system is built to be compatible with various LLM APIs and incorporates fine-tuning instructions and demonstration datasets customizable for different annotation needs. The system can be further configured to allow for model selection, hyperparameter adjustment, and tag set additions as required. Users can feed corpora into the system manually or through file upload, and the generated completion output is displayed through a text editor with options to export the data in multiple formats compatible with a variety of spreadsheet and corpus editing software.<\/p>\n<figure id=\"attachment_2114\" aria-describedby=\"caption-attachment-2114\" style=\"width: 752px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2114 size-full\" src=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/dataset-annotations.png\" alt=\"Rows include: Token (e.g. \u0627\u0644\u0648\u0627\u0646), Transliteration (e.g. alwan), Translation (e.g. colors), Form (e.g. N), Visual (e.g. SC3SQ05), Referent (e.g. R3: Notepad and Crayon (7))\" width=\"752\" height=\"218\" srcset=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/dataset-annotations.png 752w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/dataset-annotations-300x87.png 300w\" sizes=\"auto, (max-width: 752px) 100vw, 752px\" \/><figcaption id=\"caption-attachment-2114\" class=\"wp-caption-text\">Dataset annotations displayed in the EXMARaLDA Partitur Editor<\/figcaption><\/figure>\n<p>The prompts for the system instructions were engineered iteratively based on the project&#8217;s prototype corpus and initial tests of the annotation instructions. The instructions assign the task required from the agent, provide it with the desired annotation tags, and specify the output format. The instructions underwent testing with multiple LLM models and hyperparameter configurations to engineer prompts that generated satisfactory annotations. Following prompt engineering, a demonstration dataset of the desired completions was manually created to further fine-tune the LLM models through few-shot learning. Few-shot learning helps mitigate hallucination-related errors in the tagged completion. Each model&#8217;s performance after the fine-tuning process was evaluated, and the best-performing model was selected as the system&#8217;s default. A human-in-the-loop approach was also employed to manually correct the generated annotations and ensure accuracy.<\/p>\n<p>On the visual side, an ongoing component of the project is exploring the use of computer vision (CV) for detecting the objects that speakers see in virtual environments. This technique is useful for determining what visual elements were part of the speaker&#8217;s immersive experiences at any given time, and time-aligning visual stimuli with speech production. This mapping of visual input and linguistic output offers a record of the spatiotemporal setting that the speaker was experiencing at each moment of speech to assist in mitigating the ephemerality of the communicative setting.<\/p>\n<p>Adding annotations to the dataset informed by visual input starts with identifying all the objects contained in a stimulus virtual environment (a scene) and the actions they were involved in (a sequence). Since the environments are replicated and used across multiple speakers, utilizing CV to acquire information about the visual elements in a scene is useful to resolve the bottleneck of manually detecting these objects in the visual captures of each individual participant&#8217;s VR feed. Currently, the identified objects are used in LabelBox as terms in a custom ontology and labeled in a sample feed through a bounding box interface to be used with CV models that have custom ontology capabilities. A future step in the project is to explore how multi-agent systems can enable linguistic and visual annotator agents to work simultaneously together to provide time-aligned visual\/linguistic annotations while exchanging information that aids in each agent&#8217;s tasks.<\/p>\n<figure id=\"attachment_2115\" aria-describedby=\"caption-attachment-2115\" style=\"width: 1240px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2115 size-full\" src=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/labelbox.png\" alt=\"A website interface with multiple panels including a selectable list of annotation terms and their codenames (SC1(3) Accordion, for example), a video player with video control buttons, and a video timeline with a movable frame marker.\" width=\"1240\" height=\"530\" srcset=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/labelbox.png 1240w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/labelbox-300x128.png 300w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/labelbox-1024x438.png 1024w, https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/labelbox-768x328.png 768w\" sizes=\"auto, (max-width: 1240px) 100vw, 1240px\" \/><figcaption id=\"caption-attachment-2115\" class=\"wp-caption-text\">LabelBox computer vision custom ontology and annotation interface<\/figcaption><\/figure>\n<p>We have only started to explore the potential of integrating generative AI-based agents into linguistic research. As highlighted in this post, this integration shows promise for documenting and processing audiovisual corpora and low-resource spoken languages. This line of enquiry still holds significant potential for improving the efficiency and quality of research outputs in the field and remains largely unexplored.<\/p>\n<p>&nbsp;<\/p>\n<p>Alsayed, A. A. A. (2023). Extended reality language research: Data sources, taxonomy and the documentation of embodied corpora. Modern Languages Open, Digital Modern Languages Special Collection. Liverpool University Press. <a href=\"https:\/\/modernlanguagesopen.org\/articles\/10.3828\/mlo.v0i0.441\" target=\"_blank\" rel=\"noopener\">https:\/\/modernlanguagesopen.org\/articles\/10.3828\/mlo.v0i0.441<\/a> (Retrieved December 20th, 2023).<\/p>\n<p>Cheng, Y., Zhang, C., Zhang, Z., Meng, X., Hong, S., Li, W., et al. (2024). Exploring large language model-based intelligent agents: Definitions, methods, and prospects. ArXiv Preprint. <a href=\"https:\/\/arxiv.org\/abs\/2401.03428\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/abs\/2401.03428<\/a> (Retrieved May 29th, 2024).<\/p>\n<p>Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., et al. (2023). A comprehensive overview of large language models. ArXiv Preprint. <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2307.06435\" target=\"_blank\" rel=\"noopener\">https:\/\/doi.org\/10.48550\/arXiv.2307.06435<\/a> (Retrieved January 15th, 2024).<\/p>\n<p>Turow, J. (2024, June 5). The rise of AI agent infrastructure. Madrona. <a href=\"https:\/\/www.madrona.com\/the-rise-of-ai-agent-infrastructure\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.madrona.com\/the-rise-of-ai-agent-infrastructure\/<\/a> (Retrieved June 9th, 2024).<\/p>\n<p>Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., et al. (2022). Emergent abilities of large language models. ArXiv Preprint. <a href=\"https:\/\/doi.org\/10.48550\/arXiv.2206.07682\" target=\"_blank\" rel=\"noopener\">https:\/\/doi.org\/10.48550\/arXiv.2206.07682<\/a> (Retrieved May 29th, 2024).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>by Abdulrahman A. A. Alsayed University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei&#8230;<a class=\"read-more\" href=\"&#104;&#116;&#116;&#112;&#115;&#58;&#47;&#47;&#114;&#101;&#115;&#101;&#97;&#114;&#99;&#104;&#46;&#114;&#101;&#97;&#100;&#105;&#110;&#103;&#46;&#97;&#99;&#46;&#117;&#107;&#47;&#100;&#105;&#103;&#105;&#116;&#97;&#108;&#104;&#117;&#109;&#97;&#110;&#105;&#116;&#105;&#101;&#115;&#47;&#100;&#101;&#118;&#101;&#108;&#111;&#112;&#105;&#110;&#103;&#45;&#97;&#105;&#45;&#98;&#97;&#115;&#101;&#100;&#45;&#97;&#103;&#101;&#110;&#116;&#115;&#45;&#102;&#111;&#114;&#45;&#97;&#117;&#116;&#111;&#109;&#97;&#116;&#105;&#110;&#103;&#45;&#108;&#105;&#110;&#103;&#117;&#105;&#115;&#116;&#105;&#99;&#45;&#114;&#101;&#115;&#101;&#97;&#114;&#99;&#104;&#45;&#119;&#111;&#114;&#107;&#102;&#108;&#111;&#119;&#115;&#45;&#102;&#105;&#110;&#101;&#45;&#116;&#117;&#110;&#105;&#110;&#103;&#45;&#99;&#111;&#114;&#112;&#117;&#115;&#45;&#97;&#110;&#110;&#111;&#116;&#97;&#116;&#111;&#114;&#115;&#45;&#97;&#115;&#45;&#97;&#110;&#45;&#101;&#120;&#97;&#109;&#112;&#108;&#101;&#47;\">Read More ><\/a><\/p>\n","protected":false},"author":902,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"__cvm_playback_settings":[],"__cvm_video_id":"","footnotes":""},"categories":[9,98],"tags":[29,54,128,101,70,67,126,66,125,37,127],"coauthors":[46],"class_list":["post-2110","post","type-post","status-publish","format-standard","hentry","category-blog","category-dh-ai-series","tag-ai","tag-artificial-intelligence","tag-audiovisual-corpora","tag-computer-vision","tag-generative-ai","tag-large-language-models","tag-linguistic-annotation","tag-llms","tag-low-resource-languages","tag-virtual-reality","tag-visual-annotation"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.8.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub<\/title>\n<meta name=\"description\" content=\"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\" \/>\n<meta property=\"og:locale\" content=\"en_GB\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub\" \/>\n<meta property=\"og:description\" content=\"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\" \/>\n<meta property=\"og:site_name\" content=\"University of Reading Digital Humanities Hub\" \/>\n<meta property=\"article:published_time\" content=\"2024-09-06T09:00:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-09-03T09:17:15+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art.png\" \/>\n<meta name=\"author\" content=\"Dawn Kanter\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@UniofReading\" \/>\n<meta name=\"twitter:site\" content=\"@UniofReading\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Dawn Kanter\" \/>\n\t<meta name=\"twitter:label2\" content=\"Estimated reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\"},\"author\":{\"name\":\"Dawn Kanter\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/a276676d1cb481204c2c6a58b78d644d\"},\"headline\":\"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example\",\"datePublished\":\"2024-09-06T09:00:58+00:00\",\"dateModified\":\"2024-09-03T09:17:15+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\"},\"wordCount\":1234,\"publisher\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization\"},\"keywords\":[\"AI\",\"artificial intelligence\",\"audiovisual corpora\",\"computer vision\",\"generative AI\",\"large language models\",\"linguistic annotation\",\"LLMs\",\"low-resource languages\",\"virtual reality\",\"visual annotation\"],\"articleSection\":[\"Blog\",\"DH &amp; AI Series\"],\"inLanguage\":\"en-GB\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\",\"url\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\",\"name\":\"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub\",\"isPartOf\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#website\"},\"datePublished\":\"2024-09-06T09:00:58+00:00\",\"dateModified\":\"2024-09-03T09:17:15+00:00\",\"description\":\"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.\",\"breadcrumb\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#breadcrumb\"},\"inLanguage\":\"en-GB\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#website\",\"url\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/\",\"name\":\"University of Reading Digital Humanities Hub\",\"description\":\"University of Reading Library\",\"publisher\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization\"},\"alternateName\":\"Digital Humanities Hub\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-GB\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization\",\"name\":\"University of Reading Digital Humanities Hub\",\"url\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/logos\/readinglogo.gif\",\"contentUrl\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/logos\/readinglogo.gif\",\"width\":2386,\"height\":782,\"caption\":\"University of Reading Digital Humanities Hub\"},\"image\":{\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/twitter.com\/UniofReading\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/a276676d1cb481204c2c6a58b78d644d\",\"name\":\"Dawn Kanter\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-GB\",\"@id\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/image\/dc9710477bd3553c193712408a7d281d\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/bbf76d4ff06a988562e8d0ade616af6a9c5e13cab9e515e926d4fc19162e549e?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/bbf76d4ff06a988562e8d0ade616af6a9c5e13cab9e515e926d4fc19162e549e?s=96&d=mm&r=g\",\"caption\":\"Dawn Kanter\"},\"url\":\"https:\/\/research.reading.ac.uk\/digitalhumanities\/author\/lk930777reading-ac-uk\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub","description":"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/","og_locale":"en_GB","og_type":"article","og_title":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub","og_description":"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.","og_url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/","og_site_name":"University of Reading Digital Humanities Hub","article_published_time":"2024-09-06T09:00:58+00:00","article_modified_time":"2024-09-03T09:17:15+00:00","og_image":[{"url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/2024\/09\/concept-art.png"}],"author":"Dawn Kanter","twitter_card":"summary_large_image","twitter_creator":"@UniofReading","twitter_site":"@UniofReading","twitter_misc":{"Written by":"Dawn Kanter","Estimated reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#article","isPartOf":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/"},"author":{"name":"Dawn Kanter","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/a276676d1cb481204c2c6a58b78d644d"},"headline":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example","datePublished":"2024-09-06T09:00:58+00:00","dateModified":"2024-09-03T09:17:15+00:00","mainEntityOfPage":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/"},"wordCount":1234,"publisher":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization"},"keywords":["AI","artificial intelligence","audiovisual corpora","computer vision","generative AI","large language models","linguistic annotation","LLMs","low-resource languages","virtual reality","visual annotation"],"articleSection":["Blog","DH &amp; AI Series"],"inLanguage":"en-GB"},{"@type":"WebPage","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/","url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/","name":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example - University of Reading Digital Humanities Hub","isPartOf":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#website"},"datePublished":"2024-09-06T09:00:58+00:00","dateModified":"2024-09-03T09:17:15+00:00","description":"by Abdulrahman A. A. Alsayed, University of St Andrews, United Kingdom; King Faisal University, Saudi Arabia. Recent strides in generative AI development and the emerging abilities of pre-trained models (Wei et al., 2022) hold significant potential to automate tasks and alleviate hurdles in research workflows.","breadcrumb":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#breadcrumb"},"inLanguage":"en-GB","potentialAction":[{"@type":"ReadAction","target":["https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/developing-ai-based-agents-for-automating-linguistic-research-workflows-fine-tuning-corpus-annotators-as-an-example\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/research.reading.ac.uk\/digitalhumanities\/"},{"@type":"ListItem","position":2,"name":"Developing AI-based Agents for Automating Linguistic Research Workflows: Fine-tuning Corpus Annotators as an Example"}]},{"@type":"WebSite","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#website","url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/","name":"University of Reading Digital Humanities Hub","description":"University of Reading Library","publisher":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization"},"alternateName":"Digital Humanities Hub","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/research.reading.ac.uk\/digitalhumanities\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-GB"},{"@type":"Organization","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#organization","name":"University of Reading Digital Humanities Hub","url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/","logo":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/logo\/image\/","url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/logos\/readinglogo.gif","contentUrl":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-content\/uploads\/sites\/233\/logos\/readinglogo.gif","width":2386,"height":782,"caption":"University of Reading Digital Humanities Hub"},"image":{"@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/twitter.com\/UniofReading"]},{"@type":"Person","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/a276676d1cb481204c2c6a58b78d644d","name":"Dawn Kanter","image":{"@type":"ImageObject","inLanguage":"en-GB","@id":"https:\/\/research.reading.ac.uk\/digitalhumanities\/#\/schema\/person\/image\/dc9710477bd3553c193712408a7d281d","url":"https:\/\/secure.gravatar.com\/avatar\/bbf76d4ff06a988562e8d0ade616af6a9c5e13cab9e515e926d4fc19162e549e?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/bbf76d4ff06a988562e8d0ade616af6a9c5e13cab9e515e926d4fc19162e549e?s=96&d=mm&r=g","caption":"Dawn Kanter"},"url":"https:\/\/research.reading.ac.uk\/digitalhumanities\/author\/lk930777reading-ac-uk\/"}]}},"cc_featured_image_caption":{"caption_text":"","source_text":"","source_url":""},"_links":{"self":[{"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/posts\/2110","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/users\/902"}],"replies":[{"embeddable":true,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/comments?post=2110"}],"version-history":[{"count":10,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/posts\/2110\/revisions"}],"predecessor-version":[{"id":2141,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/posts\/2110\/revisions\/2141"}],"wp:attachment":[{"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/media?parent=2110"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/categories?post=2110"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/tags?post=2110"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/research.reading.ac.uk\/digitalhumanities\/wp-json\/wp\/v2\/coauthors?post=2110"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}