Publications

ORCID | Google Scholar


Table of Contents


Interactive Visualization of Metric Distortion in Nonlinear Data Embeddings using the distortions Package

Kris Sankaran, Shuzhen Zhang, Chenab, Marina Meila

Abstract: Nonlinear dimensionality reduction methods like UMAP and t-SNE can help to organize high-dimensional genomics data into manageable low-dimensional representations, like cell types or differentiation trajectories. Such reductions can be powerful, but inevitably introduce distortion. A growing body of work has demonstrated that this distortion can have serious consequences for downstream interpretation, for example, suggesting clusters that do not exist in the original data. Motivated by these developments, we implemented a software package, distortions, which builds on state-of-the-art methods for measuring local distortion and displays them in an intuitive and interactive way. Through case studies on simulated and real data, we find that the visualizations can help flag fragmented neighborhoods, support hyperparameter tuning, and enable method selection. We believe that this extra layer of information will help practitioners use nonlinear dimensionality reduction methods more confidently. The package documentation and notebooks reproducing all case studies are available online at https://krisrs1128.github.io/distortions/site/.

Paper: Under Review | PDF

Code:

Semisynthetic Simulation for Microbiome Data Analysis

Kris Sankaran, Saritha Kodikara, Jingyi Jessica Li, Kim-Anh Lê Cao

Abstract: High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.

Paper: Briefings in Bioinformatics | PDF

Code:

multimedia: Multimodal Mediation Analysis of Microbiome Data

Hanying Jiang, Xinran Miao, Margaret W. Thairu, Mara Beebe, Dan W. Grupe, Richard J. Davidson, Jo Handelsman, and Kris Sankaran

Abstract: Mediation analysis has emerged as a versatile tool for answering mechanistic questions in microbiome research because it provides a statistical framework for attributing treatment effects to alternative causal pathways. Using a series of linked regression models, this analysis quantifies how complementary data modalities relate to one another and respond to treatments. Despite these advances, the rigid modeling assumptions of existing software often results in users viewing mediation analysis as a black box, not something that can be inspected, critiqued, and refined. We designed the multimedia R package to make advanced mediation analysis techniques accessible to a wide audience, ensuring that all statistical components are easily interpretable and adaptable to specific problem contexts. The package provides a uniform interface to direct and indirect effect estimation, synthetic null hypothesis testing, and bootstrap confidence interval construction. We illustrate the package through two case studies. The first re-analyzes a study of the microbiome and metabolome of Inflammatory Bowel Disease patients, uncovering potential mechanistic interactions between the microbiome and disease-associated metabolites, not found in the original study. The second analyzes new data about the influence of mindfulness practice on the microbiome. The mediation analysis identifies a direct effect between a randomized mindfulness intervention and microbiome composition, highlighting shifts in taxa previously associated with depression that cannot be explained by diet or sleep behaviors alone. A gallery of examples and further documentation can be found at https://go.wisc.edu/830110.

Paper: Microbiology Spectrum | PDF

Code:

Data Science Principles for Interpretable and Explainable AI

Kris Sankaran

Abstract: Society's capacity for algorithmic problem-solving has never been greater. Artificial Intelligence is now applied across more domains than ever, a consequence of powerful abstractions, abundant data, and accessible software. As capabilities have expanded, so have risks, with models often deployed without fully understanding their potential impacts. Interpretable and interactive machine learning aims to make complex models more transparent and controllable, enhancing user agency. This review synthesizes key principles from the growing literature in this field. We first introduce precise vocabulary for discussing interpretability, like the distinction between glass box and explainable algorithms. We then explore connections to classical statistical and design principles, like parsimony and the gulfs of interaction. Basic explainability techniques -- including learned embeddings, integrated gradients, and concept bottlenecks -- are illustrated with a simple case study. We also review criteria for objectively evaluating interpretability approaches. Throughout, we underscore the importance of considering audience goals when designing interactive algorithmic systems. Finally, we outline open challenges and discuss the potential role of data science in addressing them. Code to reproduce all examples can be found at \href{https://go.wisc.edu/3k1ewe}{https://go.wisc.edu/3k1ewe}.

Paper: Journal of Data Science | PDF

Code:

MolPad: An R-Shiny Package for Cluster Co-Expression Analysis in Longitudinal Microbiomics

Kaiyan Ma, Margaret W. Thairu, and Kris Sankaran

Abstract: The R-Shiny package MolPad provides an interactive dashboard for understanding the dynamics of longitudinal molecular co-expression in microbiomics. The main idea for addressing the issue is first to use a network to overview major patterns among their predictive relationships and then zoom into specific clusters of interest. It is designed with a focus-plus-context analysis strategy and automatically generates links to online curated annotations. The dashboard consists of a cluster-level network, a bar plot of taxonomic composition, a line plot of data modalities, and a table for each pathway. Further, the package includes functions that handle the data processing for creating the dashboard. This makes it beginner-friendly for users with less R programming experience. We illustrate these methods with a case study on a longitudinal metagenomics analysis of the cheese microbiome. https://connect.doit.wisc.edu/molpad-demo/

Paper: Under Review | PDF

Code:

mbtransfer: Microbiome Intervention Analysis using Transfer Functions and Mirror Statistics

Kris Sankaran and Pratheepa Jeganathan

Abstract: Microbiome interventions provide valuable data about microbial ecosystem structure and dynamics. Despite their ubiquity in microbiome research, few rigorous data analysis approaches are available. In this study, we extend transfer function-based intervention analysis to the microbiome setting, drawing from advances in statistical learning and selective inference. Our proposal supports the simulation of hypothetical intervention trajectories and False Discovery Rate-guaranteed selection of significantly perturbed taxa. We explore the properties of our approach through simulation and re-analyze three contrasting microbiome studies. An R package, mbtransfer, is available at https://go.wisc.edu/crj6k6. Notebooks to reproduce the simulation and case studies can be found at https://go.wisc.edu/dxuibh and https://go.wisc.edu/emxv33.

Paper: PLOS Computational Biology | PDF

Code:

Microbiome composition modulates secondary metabolism in a multispecies bacterial community

Marc G. Chevrette, Chris S. Thomas, Amanda Hurley, Natalia Rosario-Meléndez, Kris Sankaran, Yixing Tu, Austin Hall, Shruthi Magesh, and Jo Handelsman

Abstract: By linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts underlying modern generative models and then analyze how they can be used to inform experimental design, iterative model refinement, goodness-of-fit evaluation, and agent-based simulation. We emphasize a modular view of generative mechanisms and discuss how they can be flexibly recombined in new problem contexts. We provide practical illustrations throughout, and code for reproducing all examples is available at https://github.com/krisrs1128/generative review. Finally, we observe how research in generative models is currently split across several islands of activity, and we highlight opportunities lying at disciplinary intersections.

Paper: Proceedings of the National Academy of Sciences | PDF

Code:

Generative Models: An Interdisciplinary Perspective

Kris Sankaran, Susan P. Holmes

Abstract: By linking conceptual theories with observed data, generative models can support reasoning in complex situations. They have come to play a central role both within and beyond statistics, providing the basis for power analysis in molecular biology, theory building in particle physics, and resource allocation in epidemiology, for example. We introduce the probabilistic and computational concepts underlying modern generative models and then analyze how they can be used to inform experimental design, iterative model refinement, goodness-of-fit evaluation, and agent-based simulation. We emphasize a modular view of generative mechanisms and discuss how they can be flexibly recombined in new problem contexts. We provide practical illustrations throughout, and code for reproducing all examples is available at https://github.com/krisrs1128/generative review. Finally, we observe how research in generative models is currently split across several islands of activity, and we highlight opportunities lying at disciplinary intersections.

Paper: Annual Reviews in Statistics and its Applications | PDF

Code:

Spatial Transcriptomics Dimensionality Reduction using Wavelet Bases

Zhuoyan Xu and Kris Sankaran

Abstract: Spatially resolved transcriptomics (ST) measures gene expression along with the spatial coordinates of the measurements. The analysis of ST data involves significant computation complexity. In this work, we propose gene expression dimensionality reduction algorithm that retains spatial structure. We combine the wavelet transformation with matrix factorization to select spatially-varying genes. We extract a low-dimensional representation of these genes. We consider Empirical Bayes setting, imposing regularization through the prior distribution of factor genes. Additionally, We provide visualization of extracted representation genes capturing the global spatial pattern. We illustrate the performance of our methods by spatial structure recovery and gene expression reconstruction in simulation. In real data experiments, our method identifies spatial structure of gene factors and outperforms regular decomposition regarding reconstruction error. We found the connection between the fluctuation of gene patterns and wavelet technique, providing smoother visualization. We develop the package and share the workflow generating reproducible quantitative results and gene visualization. The package is available at [GitHub](https://github.com/OliverXUZY/waveST).

Paper: Under Review | PDF

Code:

Artificial Intelligence for Climate Change Adaptation

So-Min Cheong, Kris Sankaran, Hamsa Bastani

Abstract: Although artificial intelligence (AI; inclusive of machine learning) is gaining traction supporting climate change projections and impacts, limited work has used AI to address climate change adaptation. We identify this gap and highlight the value of AI especially in supporting complex adaptation choices and implementation. We illustrate how AI can effectively leverage precise, real-time information in data-scarce settings. We focus on supervised learning, transfer learning, reinforcement learning, and multimodal learning to illustrate how innovative AI methods can enable better-informed choices, tailor adaptation measures to heterogenous groups and generate effective synergies and trade-offs.

Paper: WIRES Data Mining and Knowledge Discovery. April 2022. https://doi.org/10.1002/widm.1459 | PDF


Source Data Selection for Out-of-Domain Generalization

Xinran Miao and Kris Sankaran

Abstract: Models that perform out-of-domain generalization borrow knowledge from heterogeneous source data and apply it to a related but distinct target task. Transfer learning has proven effective for accomplishing this generalization in many applications. However, poor selection of a source dataset can lead to poor performance on the target, a phenomenon called negative transfer. In order to take full advantage of available source data, this work studies source data selection with respect to a target task. We propose two source selection methods that are based on the multi-bandit theory and random search, respectively. We conduct a thorough empirical evaluation on both simulated and real data. Our proposals can be also viewed as diagnostics for the existence of a reweighted source subsamples that perform better than the random selection of available samples.

Paper: Under review. | PDF

Code:

Estimating Glacial Lake Trends using Historically Guided Segmentation Models

Weiyushi Tian, Anthony Ortiz, Tenzing C. Sherpa, Finu Shresta, Mir Matin, Rahul Dodhia, Juan M. Lavista Ferres, and Kris Sankaran

Abstract: We compare several approaches to segmenting glacial lakes in the Hindu Kush Himalayas in order to support glacial lake area monitoring. More automatic monitoring could support risk assessments of Glacial Lake Outburst Floods (GLOF), a type of natural hazard that poses a risk to communities and infrastructure living in valleys below glacial lakes. We evaluate several approaches to incorporate labels from a 2015 survey using Landsat 7 imagery to guide segmentation on newer higher resolution satellite images like Sentinel 2 and Bing Maps imagery, comparing them also to approaches that do not use this form of weak prior. We find that a guided-version of U-Net and a properly initialized form of morphological snakes are most effective for these two datasets, respectively, each providing between an 8 - 10% IoU improvement over a standard U-Net. An error analysis highlights the strengths and limitations of each approach. We design visualizations to support discovery of lakes of potential concern, including an interactive exploratory interface. All code supporting our study are released in public repositories

Paper: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | PDF

Code:

Interactive Visualization and Representation Analysis Applied to Glacier Segmentation

Minxing Zheng, Xinran Miao, Kris Sankaran

Abstract: Interpretability has attracted increasing attention in earth observation problems. We apply interactive visualization and representation analysis to guide the interpretation of glacier segmentation models. We visualize the activations from a U-Net to understand and evaluate the model performance. We built an online interface using the Shiny R package to provide comprehensive error analysis of the predictions. Users can interact with the panels and discover model failure modes. We illustrate an example of how our interface could help guide decisions for improving model performance. Further, we discuss how visualization can provide sanity checks during data preprocessing and model training. By closely examining the problem of glacier segmentation, we are able to discuss how visualization strategies can support the modeling process and the interpretation of prediction results from geospatial deep learning. The app can be viewed [here](https://bruce-zheng.shinyapps.io/glacier_segmententation/).

Paper: ISPRS International Journal for Geoinformation. 11(8). | PDF

Code:

Interactive Visualization of Spatial Omics Neighborhoods

Tinghui Xu, Kris Sankaran

Abstract: Dimensionality reduction of spatial omic data can reveal shared, spatially structured patterns of expression across a collection of genomic features. We study strategies for discovering and interactively visualizing low-dimensional structure in spatial omic data based on the construction of neighborhood features. We design quantile and network-based spatial features that result in spatially consistent embeddings. A simulation compares embeddings made with and without neighborhood-based featurization, and a re-analysis of [Keren et al., 2019] illustrates the overall workflow. We provide an R package, NBFvis, to support computation and interactive visualization for the proposed dimensionality reduction approach. Code and data for reproducing experiments and analysis is available at https://github.com/XTH1114/NBFvis

Paper: F1000 Research | PDF

Code:

Multiscale Analysis of Count Data through Topic Alignment

Julia Fukuyama, Kris Sankaran, Laura Symul

Abstract: Topic modeling is a popular method used to describe biological count data. With topic models, the user must specify the number of topics K. Since there is no definitive way to choose K and since a true value might not exist, we develop techniques to study the relationships across models with different K. This can show how many topics are consistently present across different models, if a topic is only transiently present, or if a topic splits in two when K increases. This strategy gives more insight into the process generating the data than choosing a single value of K would. We design a visual representation of these cross-model relationships, which we call a topic alignment, and present three diagnostics based on it. We show the effectiveness of these tools for interpreting the topics on simulated and real data, and we release an accompanying R package, alto.

Paper: Biostatistics | PDF

Code:

Bootstrap Confidence Regions for Learned Feature Embeddings

Kris Sankaran

Abstract: Algorithmic feature learners provide high-dimensional vector representations for non-matrix structured signals, like images, audio, text, and graphs. Low-dimensional projections derived from these representations can be used to explore variation across collections of these data. However, it is not clear how to assess the uncertainty associated with these projections. We adapt methods developed for bootstrapping principal components analysis to the setting where features are learned from non-matrix data. We empirically compare the derived confidence regions in simulations, varying factors that influence both feature learning and the bootstrap. Approaches are illustrated on spatial proteomic data. Code, data, and trained models are released as an R compendium.

Paper: Journal of Computational and Graphical Statistics | PDF

Code:

Machine Learning for Glacier Monitoring in the Hindu Kush Himalaya

Shimaa Baraka, Benjamin Akera, Bibek Aryal, Tenzing Sherpa, Finu Shresta, Anthony Ortiz, Kris Sankaran, Juan Lavista Ferres, Mir Matin, Yoshua Bengio

Abstract: Glacier mapping is key to ecological monitoring in the hkh region. Climate change poses a risk to individuals whose livelihoods depend on the health of glacier ecosystems. In this work, we present a machine learning based approach to support ecological monitoring, with a focus on glaciers. Our approach is based on semi-automated mapping from satellite images. We utilize readily available remote sensing data to create a model to identify and outline both clean ice and debris-covered glaciers from satellite imagery. We also release data and develop a web tool that allows experts to visualize and correct model predictions, with the ultimate aim of accelerating the glacier mapping process.

Paper: Climate Change AI Workshop (Spotlight). December 9, 2020. | PDF

Code:

Tackling Climate Change with Machine Learning

David Rolnick, Priya L Donti, Lynn H Kaack, Kelly Kochanski, Alexandre Lacoste, Kris Sankaran, Andrew Slavin Ross, Nikola Milojevic-Dupont, Natasha Jaques, Anna Waldman-Brown, Alexandra Luccioni, Tegan Maharaj, Evan D Sherwin, S Karthik Mukkavilli, Konrad P Kording, Carla Gomes, Andrew Y Ng, Demis Hassabis, John C Platt, Felix Creutzig, Jennifer Chayes, Yoshua Bengio

Abstract: Climate change is one of the greatest challenges facing humanity, and we, as machine learning experts, may wonder how we can help. Here we describe how machine learning can be a powerful tool in reducing greenhouse gas emissions and helping society adapt to a changing climate. From smart grids to disaster management, we identify high impact problems where existing gaps can be filled by machine learning, in collaboration with other fields. Our recommendations encompass exciting research questions as well as promising business opportunities. We call on the machine learning community to join the global effort against climate change.

Paper: ACM Computing Surveys | PDF


Latent Variable Modeling for the Microbiome

Kris Sankaran, Susan P Holmes

Abstract: The human microbiome is a complex ecological system, and describing its structure and function under different environmental conditions is important from both basic scientific and medical perspectives. Viewed through a biostatistical lens, many microbiome analysis goals can be formulated as latent variable modeling problems. However, although probabilistic latent variable models are a cornerstone of modern unsupervised learning, they are rarely applied in the context of microbiome data analysis, in spite of the evolutionary, temporal, and count structure that could be directly incorporated through such models. We explore the application of probabilistic latent variable models to microbiome data, with a focus on Latent Dirichlet allocation, Non-negative matrix factorization, and Dynamic Unigram models. To develop guidelines for when different methods are appropriate, we perform a simulation study. We further illustrate and compare these techniques using the data of Dethlefsen and Relman (2011),a study on the effects of antibiotics on bacterial community composition. Code and data for all simulations and case studies are available publicly.

Paper: Biostatistics. October 2019. https://doi.org/10.1093/biostatistics/kxy018 | PDF

Code:

Multitable Methods for Microbiome Data Integration

Kris Sankaran, Susan P Holmes

Abstract: The simultaneous study of multiple measurement types is a frequently encountered problem in practical data analysis. It is especially common in microbiome research, where several sources of data—for example, 16s-rRNA, metagenomic, metabolomic, or transcriptomic data–can be collected on the same physical samples. There has been a proliferation of proposals for analyzing such multitable microbiome data, as is often the case when new data sources become more readily available, facilitating inquiry into new types of scientific questions. However, stepping back from the rush for new methods for multitable analysis in the microbiome literature, it is worthwhile to recognize the broader landscape of multitable methods, as they have been relevant in problem domains ranging across economics, robotics, genomics, chemometrics, and neuroscience. In different contexts, these techniques are called data integration, multi-omic, and multitask methods, for example. Of course, there is no unique optimal algorithm to use across domains—different instances of the multitable problem possess specific structure or variation that are worth incorporating in methodology. Our purpose here is not to develop new algorithms, but rather to 1) distill relevant themes across different analysis approaches and 2) provide concrete workflows for approaching analysis, as a function of ultimate analysis goals and data characteristics (heterogeneity, dimensionality, sparsity). Towards the second goal, we have made code for all analysis and figures available online at https://github.com/krisrs1128/multitable_review.

Paper: Frontiers in Genetics. August 28, 2019. https://doi.org/10.3389/fgene.2019.00627 | PDF

Code:

Sex-specific Association between Gut Microbiome and Fat Distribution

Yan Min, Xiaoguang Ma, Kris Sankaran, Yuan Ru, Lijin Chen, Mike Baiocchi, Shankuan Zhu

Abstract: The gut microbiome has been linked to host obesity; however, sex-specific associations between microbiome and fat distribution are not well understood. Here we show sex-specific microbiome signatures contributing to obesity despite both sexes having similar gut microbiome characteristics, including overall abundance and diversity. Our comparisons of the taxa associated with the android fat ratio in men and women found that there is no widespread species-level overlap. We did observe overlap between the sexes at the genus and family levels in the gut microbiome, such as Holdemanella and Gemmiger; however, they had opposite correlations with fat distribution in men and women. Our findings support a role for fat distribution in sex-specific relationships with the composition of the microbiome. Our results suggest that studies of the gut microbiome and abdominal obesity-related disease outcomes should account for sex-specific differences.

Paper: Nature Communications volume. June 3, 2019. https://doi.org/10.1038/s41467-019-10440-5 | PDF

Code:

Interactive Visualization of Hierarchically Structured Data

Kris Sankaran, Susan P Holmes

Abstract: We introduce methods for visualization of data structured along trees, especially hierarchically structured collections of time series. To this end, we identify questions that often emerge when working with hierarchical data and provide an R package to simplify their investigation. Our key contribution is the adaptation of the visualization principles of focus-plus-context and linking to the study of tree-structured data. Our motivating application is to the analysis of bacterial time series, where an evolutionary tree relating bacteria is available a priori. However, we have identified common problem types where, if a tree is not directly available, it can be constructed from data and then studied using our techniques. We perform detailed case studies to describe the alternative use cases, interpretations, and utility of the proposed visualization methods.

Paper: Journal of Computational and Graphical Statistics. | PDF

Code:

Bioconductor Workflow for Microbiome Data Analysis - from raw reads to community analyses

Ben Callahan, Kris Sankaran, Julia Fukuyama, Paul Joey McMurdie, Susan P Holmes

Abstract: High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.

Paper: F1000 Research. | PDF

Code: