Institut für Data Science Datenbanken und Informationssysteme News
Three Contributions of the LUH DBS Group at BTW 2023

Three Contributions of the LUH DBS Group at BTW 2023

1- The team of our DBS group in collaboration with Prof. Dr.-Ing. David Bermbach won the first prize in the BTW Data Science Challenge 2023. 2- Maximilian Koch presented the "Duplicate Table Detection with Xash" paper at BTW 2023. 3- Ziawasch Abedjan presented "Enforcing Constraints for Machine Learning Pipelines" at the ML for Systems/Systems for ML Workshop at BTW 2023.

LUH DBS Group presented three contributions at BTW 2023:

1- The team of our DBS group in collaboration with Prof. Dr.-Ing. David Bermbach won the first prize in the BTW Data Science Challenge 2023.

This team consists of Dakai Men, Jannis Becktepe, Mahdi Esmailoghli, David Bermbach, and Ziawasch Abedjan.

Abstract: 

Cycling is a healthy and eco-friendly means of transportation. Many cities provide dedicated bicycle lanes and set up a series of bicycle-friendly traffic rules. However, as a cyclist participating in urban traffic, various factors must be considered in choosing the most convenient route. We propose a transfer learning-based route recommendation system that selects a route similar to the local cyclists by predicting the potential usage of that particular route.

 

2- Maximilian Koch presented the "Duplicate Table Detection with Xash" research paper by Maximilian Koch, Mahdi Esmailoghli, and Ziawasch Abedjan at BTW 2023.

Abstract: Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. 
Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

Paper: Link

 

3- Ziawasch Abedjan presented "Enforcing Constraints for Machine Learning Pipelines" at the ML for Systems/Systems for ML Workshop at BTW 2023.

Abstract: Responsible usage of Machine Learning (ML) systems in practice does not only require enforcing high prediction quality, but also accounting for other constraints, such as fairness, privacy, or execution time. Typically these types of constraints are tackled through multi-objective functions and dedicated models.
In this talk, I present our ideas on how to leverage the step of feature selection to support constraints. We propose Declarative Feature Selection (DFS) to simplify the design and validation of ML systems satisfying diverse user-specified constraints. We benchmark and evaluate a representative series of feature selection algorithms. From our extensive experimental results, we derive concrete suggestions on when to use which strategy and show that a meta-learning-driven optimizer can accurately predict the right strategy for an ML task at hand.

Verfasst von Ziawasch Abedjan