Courses
The courses will be given on the first day (May 8, 2023) of the EMR Conference. They will open when ten or more participants are enrolled. Each course is limited to a maximum of fifteen participants. The available courses are listed below. Click the course for details.
-
> Course 1: Introduction to explainable machine learning with examples in healthcare
Instructor(s) :About
Przemyslaw Biecek 's goal is to support humans' effectiveness through safe, ethical, effective, and automated predictions. He implements this by developing processes, methods, tools, and software for responsible machine learning.He defended his PhD in mathematical statistics in 2007 and gained the full professor title in computer science in 2023. During this time he worked in various positions. As academic researcher at the Warsaw University of Technology and the University of Warsaw. As AI expert at the OECD and GPAI.AI. As ML specialist at Samsung, IBM, Netezza and Disney. As entrepreneur, he founded the MI2.AI RedTeam which offers services on training and auditing predictive models from the perspective of transparency, robustness and fairness.
He has always been fascinated by data visualisation. He now uses this interest to work on the visualisation of predictive models. This is the topic of his latest book "Explanatory Model Analysis" https://ema.drwhy.ai/.
In free time, he writes stories and comics in the Beta and Bit series introducing Data Literacy to high school students https://www.mi2.ai/beta-bit.html.
The aim of the workshop is to introduce participants to explainable artificial intelligence (XAI) methods that can be used to build predictive models and extract knowledge from predictive models. The workshop will combine discussion of the theoretical basis together with examples with code for your own execution. We will use real-world data for a mortality prediction problem for covid or classification problem for heart disease.
The discussed methods are available in many programming languages and various libraries, but the workshop will be based on examples in R using the DALEX library. The scope of the workshop coincides with that of the book Explanatory Model Analysis https://ema.drwhy.ai/.
The first part of the workshop is dedicated to exploratory data analysis tools and preparing for modelling. The second part of the workshop is focused on tools for developing predictive models. For the purposes of the example, we will discuss decision trees, random forests and techniques for automatic tuning of random forests. The third part will focus on local model explanation techniques. We will discuss SHAP (Shapley values), break-down and LIME, the most popular methods for local exploration of models. The fourth part will be devoted to global model explanation techniques. We will discuss the permutation importance technique for variables and the Partial Dependence technique. The workshop will be based on material from https://github.com/BetaAndBit/RML
Why?
Complex machine learning models are frequently used in predictive modeling. There are a lot of examples for random forest-like or boosting-like models in medicine, finance, agriculture, etc. But who trusts in black boxes? In this workshop we will show why and how one would analyse the structure of the black-box model. This will be a hands-on workshop. In each part there will be a short lecture and then time for practice and discussion. Using the example of analysing a specific dataset, we will show the basics of modelling with tree models. We will then show how to evaluate and analyse such models using XAI techniques. From the packages, we will learn about randomForest, party, mlr3, DALEX, modelStudio and arenar.
-
> Course 2: Removing unwanted variation from large-scale RNA sequencing data with PRPS
Instructor(s) :About
Ramyar Molania is a computational cancer biologist at Walter and Eliza Hall Institute (WEHI) of medical research in Australia . He received a PhD in bioinformatics from The University of Melbourne in 2018 and then he joined Professor Tony Papenfuss laboratory as a research fellow. His research interests mainly involve in developing statistical and computational tools for removal of unwanted variation from large and complicated single cell and bulk gene expression data to achieve reliable and accurate results. He is also working on integration of multimodal single cell data to better understand the treatment responses of cancer patients. He is PCF young investigator, and he works with Associate Professor Shahneen Sandhu on integrative analysis of multimodal single cell data of metastatic castration resistant prostate cancer in response to new treatments.About
Marie Trussart is a senior post-doctoral researcher from the Walter and Eliza Hall Institute in Australia in the department of Bioinformatics. She got her PhD degree on System Biology and Structural Biology Modeling from the EMBL Centre for Genomic Regulation in Barcelona, Spain working on Hi-C data and 3D modelling of the chromosome structure. She joined Terry Speed’s lab as a postdoc fellow about 6 years ago and is focusing on the development of new statistical methods in large bulk RNA-seq dataset and single-cell protein data in cancer research.Large scale datasets generated by different omics technologies present unique challenges in terms of normalization and integration. This course focuses on expanding biostatistical and bioinformatics methods for such challenges. We will be focusing on the RUV normalization methods, which have shown great promise in dealing with the challenges presented by large scale datasets from TCGA. RUV-PRPS which is a novel strategy (Molania et al, 2023, Nat. Biotech, https://www.nature.com/articles/s41587-022-01440-w#code-availability) uses pseudo-replicates of pseudo-samples (PRPS) to normalize RNA-seq data in situations when technical replicate is not available. In this course we will be presenting the new RUV-PRPS package we have been developing, which is a user-friendly R package that enable researchers to run RUV-PRPS method and to visualize diagnostic plots before and after normalization to assess the quality and consistency of their data.
Session 1: Introduction to large-scale RNA sequencing and RUV methods - Theorical session - Introduction on Removing Unwanted Variation (RUV) methods and model
- Pseudo-replicates and pseudo-samples approach (Ramyar et al, Nature Biotech, 2023, https://www.nature.com/articles/s41587-022-01440-w#code-availability)
Session 2: Identification of unwanted variation in RNA-seq data - Hands on session - RNA-seq from the Cancer Genome Atlas (TCGA) and their provided normalisations
- RUV-PRPS package with statistical methods to identify unwanted variation:
- Functions to identify variation in categorical variables: PCA, silhouette coefficient, ARI, ANOVA, vector correlation.
- Functions to identify variation in continuous variables: Linear regression, correlation.
Session 3: How to apply RUV-PRPS - Hands on session - Selection of negative control genes
- Function from RUV-PRPS package to create pseudo-replicates of pseudo-sample (PRPS) to correct for library size, batch effects and tumour purity.
- Function from RUV-PRPS package to run RUV-PRPS method.
Session 4: Normalisation performance assessment - Hands on session - RUV-PRPS package with statistical methods to assess the performance of normalization method:
- Functions to identify variation in categorical variables: PCA, silhouette coefficient, ARI, ANOVA, vector correlation.
- Functions to identify variation in continuous variables: Linear regression, correlation.
- How unwanted variation can influence down-stream analysis including gene-gene correlation, survival analysis
-
> Course 3: How to use cloud technologies for reliable and responsible Data Science projects?
Instructor(s) :About
Erdal Coşgun received his B.Sc from the Department of Statistics, Faculty of Science at Hacettepe University in 2007. He started his Ph.D. research as a research assistant at the Department of Biostatistics, Faculty of Medicine at Hacettepe University the same year, and completed my Ph.D. thesis titled “New Approach to Unsupervised Based Classification on Microarray Data” in 2013. He worked in the Section on Statistical Genetics, Department of Biostatistics at University of Alabama at Birmingham (UAB) for 6 months in 2009. After that, he got a 3-months bioinformatics training at the Research and Development Campus of the Pfizer Inc. in Groton, MA, by receiving a full scholarship of the company in 2011. After finished his Ph.D., he worked as an Asst. Prof. at Acıbadem University, Faculty of Medicine, Department of Biostatistics and Medical Informatics. (Head of Department,2014-2015) He joined Microsoft in 2015 as a Global Black Belt, Technology Solutions Professional on Advanced Analytics. He was responsible for Azure Machine Learning, R Server and Stream Analytics across the MEA region. He has been working at Microsoft Genomics since December 2016 as a Senior Data and Applied Scientist. ... moreAbout
Vincent Carey is Associate Professor of Medicine (Biostatistics) in the Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School. As a Fulbright Specialist and as an invited lecturer, he has given short courses in statistical genomics on four continents. He was an inaugural faculty member in the Cold Spring Harbor Laboratory Summer Course on statistical analysis of genome-scale data, and is former Editor-in-Chief of The R Journal. He is Scientific Director of Bioinformatics in the National Institute of Allergy and Infectious Diseases Immune Tolerance Network, and is a member of the Scientific Advisory Board of the Vaccine and Immunology Statistical Center of the Collaboration for AIDS Vaccine Discovery. Vince is a co-founder of the Bioconductor project. ... moreAbout
Dr. Deniz İlhan Topcu is a medical doctor and biochemistry specialist with a BSc in computer engineering. With extensive training and experience in both medicine and computer science, Topcu has developed expertise in using R for clinical laboratory-related analysis. As an experienced R user, Topcu is particularly interested in the application of artificial intelligence, machine learning, and data analytics in a clinical laboratory setting. Topcu is passionate about leveraging these tools to improve patient outcomes and optimize healthcare delivery.Researchers are using cloud environments for biomedical data sharing, run analysis tools, and collaborate. In this hands-on course, we will cover the following topics:
- Create a reliable and secure cloud environment. Data Sharing and Auto scale of your compute solutions –
60 mins - Deploy and use Jupyter Lab, VS Code Server on terra.bio –
30 mins - Selected topics in genomic visualization and analysis with Bioconductor on cloud -
75 mins - Responsible Data Science use-cases –
15 mins
Requirements:
- Mid-level (200) Linux OS experience
- Mid-Level (200) R programming experience
- Mid-Level (300) Python programming experience
- Experienced in ‘Jupyter Notebook/Lab OR Hub’ usage with different kernel types (R, Python, Julia, Spark etc.)
- Virtual Machines will be provided in the course, but participants SHOULD BRING THEIR LAPTOPS OR PC.
Quota : Max. 10 participants.
Time : 3 hours. - Create a reliable and secure cloud environment. Data Sharing and Auto scale of your compute solutions –