HASLER STIFTUNG project 23073
Using Large Language Models for Text-As-Data Studies in the Social Sciences
This project explores the use of large language models (LLMs), particularly GPT-4, to enhance Text-as-Data (TaD) methods in social science research. TaD approaches, which leverage machine learning to extract valuable information from digital text data for quantitative analysis, often require specialized skills and significant preprocessing effort. Our goal is to simplify this process by leveraging the capabilities of foundation models like LLMs. We aim to determine whether LLMs can efficiently perform standard natural language processing tasks, such as sentiment analysis, and whether this can outpace traditional methods and human coders. The investigation focuses on five prevalent TaD procedures in the social sciences. Our methodology involves collecting text samples from published TaD articles, implementing standard NLP pipelines, and designing scalable prompts for GPT-4. We then compare the performance of traditional TaD procedures, our LLM approach, and human coders. To further evaluate our results, we will conduct online surveys assessing the output quality and perceptions of task performance source (human or AI). While acknowledging potential risks, we anticipate that LLMs can significantly advance TaD research by making it more accessible, efficient, and accurate, thus fostering novel empirical research avenues and enhancing TaD applicability across social-science disciplines.