Maintenance of Large Inverted Indexes [B.Sc.]

Context and Problem: 

During the last decade, enterprise, governments, academia, and open data organizations have organized large corpora of tables and made them publicly available. These pools of tables are created from different sources, from crawling the entire web [4] to collecting national-wide government data [5, 6]. These rich sources of data enable data scientists to discover tables that are related to their datasets at hand and integrate them for downstream machine learning tasks.

To enable an efficient data discovery pipeline, it is vital to store and maintain these large corpora appropriately. Therefore, researchers introduced customized indexes that support the tasks at hand.

An inverted index [1, 2] is a data structure that maps data elements to corresponding containers, such as documents, tables, columns, or rows and it serves the purpose of efficiently retrieving tables and data fragments that contain a specific information.

However, the inverted index can grow very large and as such might not fit in memory. Therefore, the inverted index for large corpora must reside on disk. Ultimately, data discovery operations require to constantly read large chunks of data from disk, which creates a considerable I/O overhead affecting the overall response time in data discovery.

The task of this thesis is to investigate possible storage-efficient replacements for the inverted index. The student will implement and compare several lightweight compression techniques with the aim of finding a scalable representation of inverted indexes where the data for large corpora, such as Webtable can reside in the main memory to achieve low response time.

The student is supposed to read a paper on the state-of-the-art inverted index [1, 2] and be able to implement it (or adapt and run the available source codes). Besides, the student is expected to implement the lightweight compression methods [3] on the inverted index for a large pool of tables and compare them with respect to the space complexity, query execution time, and compression ratio.

This thesis allows the student to work with the real-world large corpora with hundreds of millions of tables. This thesis will prepare the student to systematically approach the data science problems and address them accordingly.

Prerequisites:

  • Strong fundamentals in database concepts.
  • Strong algorithm background and familiarity with the complexity theory.
  • Programming experience in Python, Scala, or Java.
  • Interest in data integration and courage to work with large data.

Related Work:

[1] Abedjan, Ziawasch, et al. "Dataxformer: A robust transformation discovery system." 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 2016.

[2] Zhu, Erkang, et al. "Josie: Overlap set similarity search for finding joinable tables in data lakes." Proceedings of the 2019 International Conference on Management of Data. 2019.

[3] Damme, Patrick, et al. "Lightweight Data Compression Algorithms: An Experimental Survey (Experiments and Analyses)." EDBT. 2017.

[4] wwwdb.inf.tu-dresden.de/misc/dwtc/

[5] data.gov

[6] www.govdata.de

Advisor and Contact:

Mahdi Esmailoghli <esmailoghli@dbs.uni-hannover.de> (LUH)

Prof. Dr. Ziawasch Abedjan <abedjan@dbs.uni-hannover.de> (LUH)