Data Science
Faculty of Mathematics and Computer Science
Courses offered in 2024/2025, semester I
Courses offered in 2024/2025, semester I
Last updated: 01.10.2024
INTRODUCTORY MEETING: 2024, October 1 (Tuesday), at 11:00 in room 140 (Institute of Computer Science)
INFORMATION FOR DATA SCIENCE STUDENTS:
2024_information_for_students.pdf
Mandatory Courses:
- Machine Learning (ECTS: 6), Marek Adamczyk
Elective Core Courses:
Mathematical Institute:
- Modele liniowe (ECTS: 6), Liudmyla Zaitseva
- Numerical methods (ECTS: 8), Michael Hecht
- Theoretical foundations of the analysis of large data sets (ECTS: 6), Liudmyla Zaitseva
- Simulations and algorithmic applications of Markov chains (Symulacje i algorytmiczne zastosowania łańcuchów Markowa) (ECTS: 6), Paweł Lorek
- Wprowadzenie do symulacji i metod Monte Carlo (ECTS: 6), Paweł Lorek
Institute of Computer Science:
- Algorytmy ewolucyjne (ECTS: 6), Piotr Wnuk-Lipiński
- Laboratorium Deep Learningu (ECTS: 6), Marek Adamczyk
- Modele językowe (ECTS: 6), Paweł Rychlikowski
Elective Courses:
Mathematical Institute:
-
ALL lectures available at "Plan zajęć" (schedule) which are of type M (Advanced courses) belong to Elective Courses. Some most relevant:
- Statystyka (ECTS: 7), Grzegorz Wyłupek
- Szeregi czasowe (ECTS: 6), Wojciech Cygan
- (*) E-learning: Bayesian data analysis (ECTS: 3) (Hasselt)
- (*) E-learning: Computer intensive methods (ECTS: 3) (Hasselt)
- (*) E-learning: Semiparametric regression (ECTS: 4) Jaroslaw Harezlak
- REMARK (*): a) Only students who meet the entry requirements can enrol; b) These courses are recommended only to second-year DS students; c) Registration for all e-learning courses is open for a very short time sloe (for the first 1-3 days of classes).
Institute of Computer Science:
-
ALL lectures available at
Zapisy which are of type I.2 or K2 belong to Elective Courses.
seminars:
Mathematical Institute
- Modele grafowe i wnioskowanie przyczynowe , (ECTS: 4) Małgorzata Bogdan
Computer Science Institute
- Seminar: Graph Neural Networks and Applications (ECTS: 3), Piotr Wnuk-Lipiński
- Seminar: Generative AI (ECTS: 3), Rafał Nowak
- Seminarium: Explainable AI dla gier planszowych (ECTS: 3), Jakub Kowalski
Mandatory Core Courses
These three core courses are mandatory for all students. Their role is to give a basic toolbox for future data scientists and provide solid mathematical foundations that enable you to take more advanced and applied courses. We expect students to take them in the first two semesters.
Numerical Optimization
This course is a detailed survey of optimization from both a computational and theoretical perspective. Theoretical topics include convex sets, convex functions, optimization problems, least-squares, linear and quadratic programs, optimality conditions, and duality theory. Special emphasis is put on scalable numerical methods for analyzing and solving linear programs (e.g. simplex), general smooth unconstrained problems (e.g. first-order and second-order methods), quadratic programs (e.g. linear least squares), general smooth constrained problems (e.g. interior-point methods), as well as, a family of non-smooth problems (e.g. ADMM method). The applications in data sciences, such as machine learning, model fitting, and image processing, will be discussed. The computational part covers the following algorithms: gradient method, quasi-Newton methods, proximal gradient method, Nesterov’s accelerated gradient method, augmented Lagrangian method, alternating direction method of multipliers, block coordinate descent method and stochastic gradient descent method. Students complete hands-on exercises using high-level numerical software.
Machine Learning
This course provides the fundamentals of designing programs that implement a data- driven, rather than hand-implemented behavior. The course provides a gentle introduction of the topic, but strives to provide enough details and intuitions to explain state-of-the-art ML approaches: ensembles of Decision Trees (Boosted Trees, Random Forests) and Neural Networks. Starting with simple linear and Bayesian models, we proceed to learn the concepts of trainable models, selecting the best model based on data, practical and theoretical ways of estimating model performance on new data, and the difference between discriminative and generative training. The course introduces mainstream algorithms for classification and regression including linear models, Naive Bayes, trees, ensembles, and matrix factorizations for recommendation systems. Practical sessions provide a hands-on experience with the methods.
Statistical Learning
This course is mainly devoted to the analysis of ''fat'' data sets with large number of variables. In this situation the effective analysis requires techniques of dimensionality reduction. We discuss classical and modern methods of dimensionality reduction in the context of supervised and unsupervised learning. Specifically, we consider principal component analysis, subspace clustering and Gaussian graphical models (unsupervised learning) and different penalized methods for building predictive models (supervised learning) including ridge regression, LASSO and SLOPE. The emphasis is be placed on understanding the statistical properties of discussed methodology through theoretical results, simulation studies and analysis of real data.
Elective Core Courses
To gain more specialized knowledge in different areas of data science, students have to take at least four of the following fundamental elective courses. Each course is taught at least once per two years (a semester with the next edition is given at course description). Topics are subject to slight changes and updates, which reflect the evolution of the data science field and varying requirements from the job market.
Methods of classification and dimensionality reduction
The course provides a survey of dimensionality reduction (feature extraction) and classification methods. Dimensionality reduction enhances the performance of computer vision and machine learning-based approaches, it allows to represent the data in a more efficient way, it allows to visualise high-dimensional data. Among others, we study principal component analysis (PCA), non-negative matrix factorization (NMF), independent component analysis (ICA), t-distributed Stochastic Neighbour Embedding (t- SNE). Concerning classification methods, we study many classical "shallow-learning" classifiers, e.g., nearest neighbours, naive Bayes, support vector machine (SVM), linear and quadratic discriminant analysis (LDA and QDA), decision trees. Though all details are provided for most methods, we put a strong emphasis on intuition and practical applications: we discuss (and apply the acquired knowledge to various practical problems in lab classes), e.g., classification of multidimensional data (including time series, images and texts), image compression, topic recovery, recommendation systems.
Simulations and algorithmic applications of Markov chains
The course is devoted to discrete time Markov chains with finite state space. We gently start with fundamentals (stationary distributions, transition-matrix based simulations, reversibility) and go through monte carlo Markov chain methods (MCMC, a class of algorithms providing one of the currently most popular methods for simulating complicated stochastic systems); rate of convergence methods ("how many times should we shuffle a deck of cards?" — we study coupling methods, strong stationary times, strong stationary duality, inequalities (Cheeger and Poincaré) for bounding the second- largest eigenvalue of a transition matrix); coupling from the past (CFTP) algorithm (improvement of standard MCMC, allows to obtain an unbiased sample from given distribution on huge state space, e.g., Ising model); estimating winning probabilities in gambler ruin-like problems (first step analysis and Siegmund duality); simulated annealing (a widely used randomized algorithm for various optimization problems); basics of hidden markov models (HMM, a popular machine learning algorithm, e.g., for speech recognition); randomized polynomial time approximation schemes (MCMC- originated algorithm for approximating "the answer" to NP-hard related problem, e.g., graph coloring).
Natural language processing
The aim of the course is to discuss the methods used in the analysis and processing of texts in natural languages, with particular emphasis on results that can be translated into effective implementation. We consider both classical methods of language modelling (Hidden Markov Models, (Probabilistic) Context Free Grammars, Finite State Transducers) and modern, neural networks based approaches: RNN, LSTM, Convolutional Neural Networks and Transformer. We show several applications of this methods, including POS-tagging, dependency parsing, Named Entity Recognition, Machine Translation, and Natural Language Generation.
Text mining
In this course, we cover basic and more advanced information retrieval techniques. We also discuss data mining methods applied to texts. We discuss how to implement from scratch efficient systems gathering information from large text corpora. We analyze in detail several variants of word embeddings, and their use in computing texts similarity. We discuss text classification methods, flat and hierarchical clusterization, automatic summarisation, text comprehension methods and question answering.
Advanced Data Mining
This course focuses on advanced data mining algorithms for processing big, complex and unstructured data. It mainly concerns recommendation systems, dimensionality reduction with neighborhood embedding, temporal data mining and decision support systems. In recommendation systems, various approaches from simple collaborative filtering to advanced matrix factorization are presented and discussed in the context of their practical relevance, concerning not only the popular MSE or MAE measures, but also the coverage, diversity, and novelty of recommendations. In temporal data mining, beside the analysis of regular time series with machine learning methods, such as Support Vector Regression and Neural Networks, unstructured temporal data are studied. Student projects concern unstructured datasets, such as irregular multidimensional time series, GPS tracks or medical images.
Tools and methods in big data processing
This course covers the technical background useful in processing large amounts of data in a distributed environment. We discuss cloud computing basics and then introduce Hadoop Distributed File System (HDFS) architecture and MapReduce programming paradigm. The course studies in-depth Apache Spark and its ecosystem including Spark programming with Scala, Spark SQL, Spark graph processing framework GraphX, also TinkerPop and Gremlin traversals. Finally, some time is spent on stream processing technologies such as Apache Kafka and Spark Streaming.
Theory of Analysis of Large Data Sets
In this course we will provide mathematical theory, which explains statistical problems in the analysis of high dimensional data. In the first part of the lecture we will concentrate on the generic problem of estimating the vector of means of multivariate normal distribution. We will discuss testing the global hypothesis that this vector is equal to zero and show a variety of methods which are optimal under different scenarios concerning the sparsity of this vector. We will also discuss the detectability curve, which provides the relationship between the sparsity and the magnitude of the elements of the vector of means, so that the signal can be identified with testing procedures. The course will also cover different procedures for multiple testing, Stein identity and James-Stein estimator of the vector of means. Finally, we will discuss the topic of selective inference and the application of the above ideas in the context of high dimensional multiple regression.
Neural Networks
Neural Networks course provides in-depth understanding of Artificial Neural Networks and Deep Learning methods, from both a theoretical and practical standpoint. The course will neural network fundamentals and provide details on most commonly used models: convolutional architectures used for image processing, recurrent networks for sequential data, and attention-based models used in language processing. We will also study modern data generation methods, such as variational auto-encoders and generative adversarial networks, concentrating on learning features that are useful in downstream data processing tasks. Accompanying lab sessions will provide hands-on experience with the material.
Numerical programming tools and methods
Numerical programming tools and methods is an introduction to programming languages and frameworks used in data science and deep learning. Concentrating on Python ecosystem, the course teaches numpy internals, data wrangling with pandas, and efficient usage of hardware accelerators through deep learning frameworks. Techniques are illustrated using practical projects that involve physical simulations, digital signal processing, real-world data analysis and modeling using machine learning.
Analysis of Complex Data. Jarosław Harezlak
Semiparametric regression. Jarosław Harezlak
Seminar: probabilistic graphical models
Team Project: TBA
Additional courses
Students may enrich their competences by taking additional courses that are not directly data science related, but could give them competitive edge in their future careers, providing them, for example, with unique skills in optimization or algorithms. While the students may freely choose any master-level courses taught at the department, the list below contains courses that are particularly suited for the data science students.
Additional courses
- Algorithmic game theory. Jarosław Byrka
- Algorithms on strings. Paweł Gawrychowski
- Artificial intelligence in games. Jan Kowalski
- Approximation algorithms. Katarzyna Paluch
- Category theory. Maciej Piróg
- Combinatorial optimization. Katarzyna Paluch
- Combinatorics. Grzegorz Stachowiak
- Computational complexity. Krzysztof Loryś
- Computational geometry. Paweł Gawrychowski
- Cryptography. Grzegorz Stachowiak
- Data compression. Tomasz Jurdziński
- Distributed algorithms. Tomasz Jurdziński
- Interactive theorem proving in Coq. Małgorzata Biernacka and Filip Sieczkowski
- Introduction to simulations and Monte Carlo methods. Tomasz Rolski, Paweł Lorek
- Online algorithms. Marcin Bieńkowski
- Photorealistic computer graphics. Andrzej Łukaszewski
- Program analysis. Witold Charatonik
- Randomized algorithms. Marek Piotrów
- Semantics of programming languages. Dariusz Biernacki and Filip Sieczkowski
- Theory of linear and integer programming. Jarosław Byrka
- UNIX kernel structure. Krystian Bacławski
- Verification of programs. Małgorzata Biernacka and Witold Charatonik
- Word equations. Artur Jeż
Contact
Department of Mathematics and Computer Science, University of Wrocław.
ul. Joliot-Curie 15
50-384 Wrocław
Program related questions: datascience@uwr.edu.pl
Administration/recruitment procedure related questions: international@uwr.edu.pl