# Data Science

Faculty of Mathematics and Computer Science

# Courses offered in 2023/2024, semester I

Last updated: 02.10.2023

INTRODUCTORY MEETING: 2023, October 2 (Monday), at 13:30 in room 119 (Institute of Computer Science)

INFORMATION FOR DATA SCIENCE STUDENTS (Version: 28.09.2023): 2023_information_for_students.pdf

• Courses with title in English are offered in English. Courses with Polish titles _may_ be offered in English upon request

Tutoring: Paweł Rychlikowski

Mandatory Courses:

Elective Core Courses:

Mathematical Institute:

Institute of Computer Science:

Elective Courses:

Mathematical Institute:

Institute of Computer Science:

Review seminars:

Computer Science Institute

Team projects:

# Mandatory Core Courses

These three core courses are mandatory for all students. Their role is to give a basic toolbox for future data scientists and provide solid mathematical foundations that enable you to take more advanced and applied courses. We expect students to take them in the first two semesters.

# Numerical Optimization

This course is a detailed survey of optimization from both a computational and theoretical perspective. Theoretical topics include convex sets, convex functions, optimization problems, least-squares, linear and quadratic programs, optimality conditions, and duality theory. Special emphasis is put on scalable numerical methods for analyzing and solving linear programs (e.g. simplex), general smooth unconstrained problems (e.g. first-order and second-order methods), quadratic programs (e.g. linear least squares), general smooth constrained problems (e.g. interior-point methods), as well as, a family of non-smooth problems (e.g. ADMM method). The applications in data sciences, such as machine learning, model fitting, and image processing, will be discussed. The computational part covers the following algorithms: gradient method, quasi-Newton methods, proximal gradient method, Nesterov’s accelerated gradient method, augmented Lagrangian method, alternating direction method of multipliers, block coordinate descent method and stochastic gradient descent method. Students complete hands-on exercises using high-level numerical software.

# Machine Learning

This course provides the fundamentals of designing programs that implement a data- driven, rather than hand-implemented behavior. The course provides a gentle introduction of the topic, but strives to provide enough details and intuitions to explain state-of-the-art ML approaches: ensembles of Decision Trees (Boosted Trees, Random Forests) and Neural Networks. Starting with simple linear and Bayesian models, we proceed to learn the concepts of trainable models, selecting the best model based on data, practical and theoretical ways of estimating model performance on new data, and the difference between discriminative and generative training. The course introduces mainstream algorithms for classification and regression including linear models, Naive Bayes, trees, ensembles, and matrix factorizations for recommendation systems. Practical sessions provide a hands-on experience with the methods.

# Statistical Learning

This course is mainly devoted to the analysis of ''fat'' data sets with large number of variables. In this situation the effective analysis requires techniques of dimensionality reduction. We discuss classical and modern methods of dimensionality reduction in the context of supervised and unsupervised learning. Specifically, we consider principal component analysis, subspace clustering and Gaussian graphical models (unsupervised learning) and different penalized methods for building predictive models (supervised learning) including ridge regression, LASSO and SLOPE. The emphasis is be placed on understanding the statistical properties of discussed methodology through theoretical results, simulation studies and analysis of real data.

# Elective Core Courses

To gain more specialized knowledge in different areas of data science, students have to take at least four of the following fundamental elective courses. Each course is taught at least once per two years (a semester with the next edition is given at course description). Topics are subject to slight changes and updates, which reflect the evolution of the data science field and varying requirements from the job market.

# Methods of classification and dimensionality reduction

The course provides a survey of dimensionality reduction (feature extraction) and classification methods. Dimensionality reduction enhances the performance of computer vision and machine learning-based approaches, it allows to represent the data in a more efficient way, it allows to visualise high-dimensional data. Among others, we study principal component analysis (PCA), non-negative matrix factorization (NMF), independent component analysis (ICA), t-distributed Stochastic Neighbour Embedding (t- SNE). Concerning classification methods, we study many classical "shallow-learning" classifiers, e.g., nearest neighbours, naive Bayes, support vector machine (SVM), linear and quadratic discriminant analysis (LDA and QDA), decision trees. Though all details are provided for most methods, we put a strong emphasis on intuition and practical applications: we discuss (and apply the acquired knowledge to various practical problems in lab classes), e.g., classification of multidimensional data (including time series, images and texts), image compression, topic recovery, recommendation systems.

# Simulations and algorithmic applications of Markov chains

The course is devoted to discrete time Markov chains with finite state space. We gently start with fundamentals (stationary distributions, transition-matrix based simulations, reversibility) and go through monte carlo Markov chain methods (MCMC, a class of algorithms providing one of the currently most popular methods for simulating complicated stochastic systems); rate of convergence methods ("how many times should we shuffle a deck of cards?" — we study coupling methods, strong stationary times, strong stationary duality, inequalities (Cheeger and Poincaré) for bounding the second- largest eigenvalue of a transition matrix); coupling from the past (CFTP) algorithm (improvement of standard MCMC, allows to obtain an unbiased sample from given distribution on huge state space, e.g., Ising model); estimating winning probabilities in gambler ruin-like problems (first step analysis and Siegmund duality); simulated annealing (a widely used randomized algorithm for various optimization problems); basics of hidden markov models (HMM, a popular machine learning algorithm, e.g., for speech recognition); randomized polynomial time approximation schemes (MCMC- originated algorithm for approximating "the answer" to NP-hard related problem, e.g., graph coloring).

# Natural language processing

The aim of the course is to discuss the methods used in the analysis and processing of texts in natural languages, with particular emphasis on results that can be translated into effective implementation. We consider both classical methods of language modelling (Hidden Markov Models, (Probabilistic) Context Free Grammars, Finite State Transducers) and modern, neural networks based approaches: RNN, LSTM, Convolutional Neural Networks and Transformer. We show several applications of this methods, including POS-tagging, dependency parsing, Named Entity Recognition, Machine Translation, and Natural Language Generation.

# Text mining

In this course, we cover basic and more advanced information retrieval techniques. We also discuss data mining methods applied to texts. We discuss how to implement from scratch efficient systems gathering information from large text corpora. We analyze in detail several variants of word embeddings, and their use in computing texts similarity. We discuss text classification methods, flat and hierarchical clusterization, automatic summarisation, text comprehension methods and question answering.

This course focuses on advanced data mining algorithms for processing big, complex and unstructured data. It mainly concerns recommendation systems, dimensionality reduction with neighborhood embedding, temporal data mining and decision support systems. In recommendation systems, various approaches from simple collaborative filtering to advanced matrix factorization are presented and discussed in the context of their practical relevance, concerning not only the popular MSE or MAE measures, but also the coverage, diversity, and novelty of recommendations. In temporal data mining, beside the analysis of regular time series with machine learning methods, such as Support Vector Regression and Neural Networks, unstructured temporal data are studied. Student projects concern unstructured datasets, such as irregular multidimensional time series, GPS tracks or medical images.

# Tools and methods in big data processing

This course covers the technical background useful in processing large amounts of data in a distributed environment. We discuss cloud computing basics and then introduce Hadoop Distributed File System (HDFS) architecture and MapReduce programming paradigm. The course studies in-depth Apache Spark and its ecosystem including Spark programming with Scala, Spark SQL, Spark graph processing framework GraphX, also TinkerPop and Gremlin traversals. Finally, some time is spent on stream processing technologies such as Apache Kafka and Spark Streaming.

# Theory of Analysis of Large Data Sets

In this course we will provide mathematical theory, which explains statistical problems in the analysis of high dimensional data. In the first part of the lecture we will concentrate on the generic problem of estimating the vector of means of multivariate normal distribution. We will discuss testing the global hypothesis that this vector is equal to zero and show a variety of methods which are optimal under different scenarios concerning the sparsity of this vector. We will also discuss the detectability curve, which provides the relationship between the sparsity and the magnitude of the elements of the vector of means, so that the signal can be identified with testing procedures. The course will also cover different procedures for multiple testing, Stein identity and James-Stein estimator of the vector of means. Finally, we will discuss the topic of selective inference and the application of the above ideas in the context of high dimensional multiple regression.

# Neural Networks

Neural Networks course provides in-depth understanding of Artificial Neural Networks and Deep Learning methods, from both a theoretical and practical standpoint. The course will neural network fundamentals and provide details on most commonly used models: convolutional architectures used for image processing, recurrent networks for sequential data, and attention-based models used in language processing. We will also study modern data generation methods, such as variational auto-encoders and generative adversarial networks, concentrating on learning features that are useful in downstream data processing tasks. Accompanying lab sessions will provide hands-on experience with the material.

# Numerical programming tools and methods

Numerical programming tools and methods is an introduction to programming languages and frameworks used in data science and deep learning. Concentrating on Python ecosystem, the course teaches numpy internals, data wrangling with pandas, and efficient usage of hardware accelerators through deep learning frameworks. Techniques are illustrated using practical projects that involve physical simulations, digital signal processing, real-world data analysis and modeling using machine learning.

# Team Project: TBA

Students may enrich their competences by taking additional courses that are not directly data science related, but could give them competitive edge in their future careers, providing them, for example, with unique skills in optimization or algorithms. While the students may freely choose any master-level courses taught at the department, the list below contains courses that are particularly suited for the data science students.

• Algorithmic game theory. Jarosław Byrka

• Algorithms on strings. Paweł Gawrychowski

• Artificial intelligence in games. Jan Kowalski

• Approximation algorithms. Katarzyna Paluch

• Category theory. Maciej Piróg

• Combinatorial optimization. Katarzyna Paluch

• Combinatorics. Grzegorz Stachowiak

• Computational complexity. Krzysztof Loryś

• Computational geometry. Paweł Gawrychowski

• Cryptography. Grzegorz Stachowiak

• Data compression. Tomasz Jurdziński

• Distributed algorithms. Tomasz Jurdziński

• Interactive theorem proving in Coq. Małgorzata Biernacka and Filip Sieczkowski

• Introduction to simulations and Monte Carlo methods. Tomasz Rolski, Paweł Lorek

• Online algorithms. Marcin Bieńkowski

• Photorealistic computer graphics. Andrzej Łukaszewski

• Program analysis. Witold Charatonik

• Randomized algorithms. Marek Piotrów

• Semantics of programming languages. Dariusz Biernacki and Filip Sieczkowski

• Theory of linear and integer programming. Jarosław Byrka

• UNIX kernel structure. Krystian Bacławski

• Verification of programs. Małgorzata Biernacka and Witold Charatonik

• Word equations. Artur Jeż

# Contact

## Department of Mathematics and Computer Science, University of Wrocław.

ul. Joliot-Curie 15

50-384 Wrocław

Program related questions: datascience@uwr.edu.pl