Predict research trends in data science based on arXiv repository [MSc]

Context:

Research topics are constantly changing as researchers rapidly innovate and real-life demands increase dramatically, especially in the booming data science field. Data science is one of the largest communities in computer science. Due to the increasingly convenient data transmission and the lower cost of data storage, the amount of data on the network has surged in recent years, which provides great prospects for the research of big data analysis. There are large amounts of data scientists in the world who publish thousands of high-quality works each year in data science. The target of this work is to predict the research trend and discover potential research topics in data science, which would provide assistance to researchers and promote scientific development. This is a significant and challenging data analysis task. ArXiv is an open-access repository of electronic preprints and postprints in multiple fields, which contains over 2 million scientific papers. Mining the knowledge from arXiv repository is a promising direction to accomplish the task.

Related work:

There is no existing work focusing on research trends prediction in the data science field. Existing work does not address the possible impact of the differences in fields. Besides, most of the existing work is limited to leveraging small parts of the paper text from arXiv, such as only the abstract [1][2] or the title [3], which overlook large amounts of valuable information in the data. Another limitation is that they often rely on manually labeling, and use naive metrics like citation amounts to define the value of the topic.

Problem / Task:

The task is to mine the large open-source scientific work repository (eg. arXiv) to predict research trends and suggest potential topics for researchers, especially in the data science field. First, we want to collect the related papers and generate the dataset. Then, we want to use existing natural language processing methods to analyze the content of those papers in the dataset and discover the topics. Based on the topics, these papers will be further classified with machine learning or deep learning techniques. Finally, combining other features extracted from the papers and metadata, we can analyze the evolution of these topics and predict the research trend. During this process, we aim to answer all/part of the following questions:

-Which research topics have become more popular and which are less popular? Which topics draw attention all the time?

-How rapid do the research trends change in data science? What is the best time scale when analyzing the research trends?

-Is the research trend in data science different from other fields? What kind of differences? Can the general solution work well directly on the data science paper repository? Is there room for domain adaptions?

-Is it possible to suggest new potential topics for researchers by mining the existing research papers?

-Are there more effective metrics to define the value of the topic?

Prerequisites:

Experience and interest in programming
Experience in machine learning
Interest in big data mining
Interest in natural language processing

Reference:

[1] Vinodkumar Prabhakaran, William L. Hamilton, Daniel A. McFarland, Dan Jurafsky: Predicting the Rise and Fall of Scientific Topics from Trends in their Rhetorical Framing. ACL (1) 2016

[2] Steffen Eger, Chao Li, Florian Netzer, Iryna Gurevych: Predicting Research Trends From Arxiv. CoRR abs/1903.02831 (2019)

[3] Chengyao Chen, Zhitao Wang, Wenjie Li, Xu Sun: Modeling Scientific Influence for Research Trending Topic Prediction. AAAI 2018: 2111-2118

Advisor and Contact:

Binger Chen <chen@dbs.uni-hannover.de> (TU Berlin)

Prof. Dr. Ziawasch Abedjan <abedjan@dbs.uni-hannover.de> (Leibniz Universität Hannover)