Algorithms, especially those used in machine learning, give a chance to make better decisions than those made by humans. In many cases, even conceptually simple algorithms can provide better quality answers than humans, e.g. by making fewer errors.
However, this promise is far from being fulfilled to a satisfactory extent at the current level of algorithm development. The main problem is that the algorithms do not explain their decisions, and it is not clear to the user why the answer proposed by the algorithm is correct. Even if we try to recreate the reasons why a given decision was made by an algorithm, these reasons may be incomprehensible to humans. This problem can be best illustrated with the example of Deep Blue, a chess-playing computer created by IBM. A person has no chance to understand his “reasoning” for a specific move.
We can argue endlessly about whether Deep Blue actually understands the game of chess or not. However, this problem becomes a key issue when implementing algorithmic solutions in the world of real applications. We need to ensure that algorithms are understandable to the people who use them. Otherwise, users will refuse to follow the decisions made by the algorithms. Surprisingly, there are currently no good tools to provide explanations for even the most basic algorithms. For example, some people often doubt whether GPS has found the fastest route and prefer to make alternative choices themselves.
This is despite the fact that finding the shortest path to arrival is a well-studied problem and we can prove that the algorithm works correctly. The problem is to explain to the user why the solution proposed by the algorithm is the best, rather than how the algorithm itself works. Moreover, the algorithm can be used by someone who does not have the appropriate algorithmic knowledge, so explaining the reasoning becomes even more important.
The good news is: you don’t have to study algorithms for years to see that they work the right way. For example, one of the main reasons Google Maps shows alternative routes is to explain to the user why the proposed solution is the best. Algorithms are becoming more common in our lives due to increasing digitalization, the pace of which has been further accelerated by the COVID-19 pandemic. The need to explain their results therefore becomes more pressing. The work of our research group is aimed at providing the concept of tools that would enable the explanation of the most basic optimization problems, e.g. the association problem. They are a key challenge in operations research and are used to describe the alignment of jobs with employees, jobs with servers, page views with ads, residents with hospitals, students with universities, or homes with tenants.
Artificial intelligence and algorithms are two areas of computer science that have developed separately for many years. Only recently have we observed their increasingly frequent interpenetration, which favors the development of new tools in both fields. On the one hand, we are witnessing the emergence of ML algorithm tools that have been fully analyzed from a theoretical point of view, e.g. approximate nearest neighbor search (NNS). On the other hand, ML tools enter basic data structures and state-of-the-art approximation algorithms as internal components, resulting in solutions with better practical properties, e.g. indexes. These new hybrid structures are called Learned Data Structures (LDS).
Since work on these ideas has only just begun, we lack the appropriate tools to implement cutting-edge solutions. Thus, research on new tools and models is hampered. This problem could be solved by creating new algorithms and data structures along with a software library that would allow different blocks to be easily composed. In this way, it can be proven that tools exist to bridge the gap between theory and practice in algorithms and show that new theoretical developments can have real applications.
Algorithmic for data science and ML
Data analysis is becoming the driving force behind technological development. However, in order to draw meaningful conclusions from these ever-growing sets of information, we need to design effective algorithms. Our project aims to solve fundamental problems arising in the context of processing huge data sets. The long-term goal is to develop techniques for designing very fast and possibly scalable algorithms.
Graphs are the most common way of capturing structures and relationships in data, which is why they are often used in data science. The most famous examples are social network analysis, e.g. community detection or cascading processes. Graphs are also often used to represent knowledge, e.g. using knowledge graphs. This approach was visible in the transition from a relational to a graph perspective in databases. These and other examples show that graphs and graph algorithms are essential tools for representing and analyzing data. However, the size of current data sets requires parallel computing because even storing them on a single machine is often impossible. For example, the World Wide Web consists of over 5 billion websites, Google’s knowledge graph contains 800 billion facts about 8 billion entities, while Facebook has over 2 billion daily active users.
As you can see, data is usually scattered across multiple computers and calculations must be performed in parallel. Real-world systems are modeled within the Massively Parallel Computation (MPC) model, in libraries such as MapReduce, Hadoop, Spark or Flume. Although various models of parallel computing have been studied for years, MPC offers completely new possibilities and requirements. MPC computations are performed in synchronous rounds, but implementing these rounds in real systems takes a long time. One round lasts orders of magnitude longer than in the classic Cray system. Therefore, we would like to solve problems, especially those related to graphs, in as few rounds as possible. We strive to design methods to overcome complexities that could not be overcome using classical techniques and models. We will focus on new algorithmic tools that will improve the performance of both parallel and non-parallel algorithms used in data science.