Application of generative language models and retrieval-augmented generation in building a cattle disease question-answering chatbot

報告時間:2025-6-20
報告地點:Room 407
指導老師:Hsin-I Chiang
學生:Yen Linhuang
摘要

In recent years, large language models (LLMs) have shown strong abilities in semantic understanding and content generation in knowledge-intensive fields. They are increasingly seen as valuable tools for digital management in animal husbandry, supporting information integration and decision-making. However, LLMs trained on general data often produce errors and hallucinations in specialized domains. To overcome this, this study applies a Retrieval-Augmented Generation (RAG) approach, restricting responses to a curated livestock disease database to enhance accuracy and reliability. Using 1,178 bovine disease records from the Taiwan Ministry of Agriculture Animal Disease Database, a semantic retrieval database was built and integrated with RAG, resulting in a user-friendly bovine disease QA system that supports natural language symptom queries and real-time advice. The research design included two main experiments. Experiment 1 benchmarked five state-of-the-art sentence embedding models (multilingual-e5-small, multilingual-e5-large, stella-base-zh, stella_base_zh_v3_1792d, and bge-m3), selected from the global and Chinese leaderboards of the Massive Text Embedding Benchmark (MTEB), using the 1,178 QA pairs. Their Top-1 and Top-5 retrieval accuracy was evaluated. Experiment 2 compared a commercial generative language model (GPT-4o-mini) on 300 bovine disease QA tasks, both with and without RAG integration. Performance was assessed using BERTScore (Precision, Recall, and F1 Score), which measures the cosine similarity between generated and reference answers to evaluate semantic quality. In the retrieval experiment, bge-m3 achieved the best results (Top-1 = 96.69%, Top-5 = 99.75%), followed by stella_base_zh_v3_1792d (Top-1 = 94.65%, Top-5 = 99.41%) and multilingual-e5-large (Top-1 = 92.11%, Top-5 = 98.39%), indicating that bge-m3 is particularly well-suited for semantic extraction in the livestock domain. In the generation experiment on 300 test questions, BERTScore evaluation showed that GPT-4o-mini without RAG achieved a Precision of 0.5535, Recall of 0.6344, and F1 Score of 0.5902. With RAG, these metrics improved to Precision 0.6297, Recall 0.7435, and F1 Score 0.6795, confirming that retrieval augmentation helps LLMs better capture domain-specific knowledge and improves the correctness of responses. This approach can be broadly adapted to other livestock and poultry species, providing an empirical foundation for developing intelligent question-answering systems in animal agriculture.

參考文獻
  • Achiam, J., S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, and S. Anadkat. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
  • Chen, J., Lin, H., Han, X., and Sun, L. 2024. Benchmarking large language models in retrieval-augmented generation. Proc AAAI Conf Artif Intell. 38: 17754-17762.
  • Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. 2023. RAGAS: Automated evaluation of retrieval augmented generation. arXiv. 2309.15217.
  • Hendrycks, D., C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300
  • Han, K., Xiao, A., Wu, E., Guo, J., Xu, C. and Wang, Y. 2021. Transformer in transformer. Adv. Neural Inf. Process. Syst. 34:15908–15919.
  • Jeong, M., Sohn, J., Sung, M., and Kang, J. 2024. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics. 40(Suppl 1): i119-i129.
  • Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.T., Rocktäschel, T., Riedel, S., and Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 33:9459-9474.
  • Liu, S., McCoy, A. B., and Wright, A. 2025. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J. Am. Med. Inform. Assoc. ocaf008.
  • Unanue, I. J., Parnell, J., and Piccardi, M. 2021. BERTTune: Fine-tuning neural machine translation with BERTScore. arXiv Prepr. arXiv:2106.02208.
  • Wang, C., Long, Q., Xiao, M., Cai, X., Wu, C., Meng, Z., and Zhou, Y. 2024. Biorag: A rag-llm framework for biological question reasoning. arXiv Prepr. arXiv:2408.01107.
  • Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. 2019. BERTScore: Evaluating text generation with BERT. arXiv Prepr. arXiv:1904.09675.