Enhancing BERT-Based Sentiment Analysis on Tweets about ChatGPT through Data Augmentation [B.Sc.]
Context:
Sentiment analysis on social media, particularly Twitter, faces challenges in accurately capturing sentiments expressed in short-form content. Large language models (LLMs), like BERT, have demonstrated proficiency in understanding contextual information, but the scarcity of labeled data for specific entities, such as ChatGPT, remains a hurdle. This research aims to explore the impact of data augmentation techniques on improving the performance of BERT-based sentiment analysis on Twitter, with a focus on tweets discussing ChatGPT.
Related works:
LLMs like BERT have shown effectiveness in understanding context and classifying sentiment [1] [2]. However, Current sentiment analysis methods are struggling with limited labeled data, especially for webdata and data in specific domains. The application of data augmentation to boost the general sentiment analysis has been shown to be effective [3]. However, how it can enhance LLMs on sparse tweet short-form, user-generated text data, remains underexplored. This work aims at sentiment analysis, emphasizing the integration of data augmentation techniques to mitigate the challenges of data scarcity when working with BERT-based models.
Problem:
The task involves implementing data augmentation techniques to enhance the sentiment analysis performance of BERT on Twitter discussions about ChatGPT. The input is a dataset of labeled tweets related to ChatGPT, and the output is an augmented dataset that will be used as training data for BERT. We will evaluate the performance of BERT on this augmented dataset.
Dataset: https://www.kaggle.com/datasets/charunisa/chatgpt-sentiment-analysis/data
Data augmentation (DA) techniques [3]:
- Synonym Replacement
- Random Insertion
- Random Swap
- Random Deletion(RD)
Tasks:
Task 1: Deploy BERT on the provided dataset to conduct sentiment analysis, categorizing sentiments as positive, negative, or neutral.
Task 2: Employ various DA techniques individually on the provided dataset.
Task 3: Conduct experiments by combining different DA techniques to evaluate and compare their performance.
Task 4: Test and fine-tune different parameters of the DA techniques, such as the augmentation ratio.
Task 5: Explore potential adaptations of DA techniques specifically tailored for Twitter data.
Prerequisites:
Proficiency in at least one programming language for data preprocessing
Basic understanding of large language model concepts
Referenced:
[1] Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, Jianfeng Gao: Deep Learning-based Text Classification: A Comprehensive Review. ACM Comput. Surv. 54(3): 62:1-62:40 (2022)
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT (1) 2019: 4171-4186
[3] Jason W. Wei, Kai Zou: EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. EMNLP/IJCNLP (1) 2019: 6381-6387
Advisor and Contact:
Binger Chen <chen@tu-berlin.de> (TU-Berlin)
Prof. Dr. Ziawasch Abedjan <abedjan@dbs.uni-hannover.de> (LUH)