Statistics & Data Science Seminar

Day & Time: Thursdays, at 2:00 pm - 3:00 pm
Themes: Statistics & Data Science: their theory, methodology and applications
Modality: Participants and speakers join via Zoom using the link: https://auburn.zoom.us/j/93758346031

Fall 2020 Schedule

Our Statistics & Data Science Seminars has ended for Fall 2020

Upcoming Seminars

Our Statistics & Data Science Seminars will resume in Spring 2021

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Past Seminars

September 10: Roberto Molinari (Mathematics and Statistics, Auburn University)

Title: SWAG: A Sparse Wrapper Algorithm with Applications in Gene Selection

Abstract: In genomics an important goal is to achieve high predictive (classification) power using as few gene expressions as possible to facilitate replicability and interpretability for research and diagnostic purposes. Given these goals, the reliance on a single model/learner, possibly issued from a sparse learning method, very probably doesn’t respond to these needs in many cases. We put forward a heuristic greedy algorithm that allows to find a set/library of extremely sparse and highly predictive learners and discuss how this procedure could be modified to select learners based on other (non-convex) utility functions.

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

September 17: Luke Oeding (Mathematics and Statistics, Auburn University)

Title: Stochastic Alternating Least Squares for Tensor Decomposition

Abstract: Least Squares is the standard method for approximating the solution to overdetermined systems of linear equations. It is known to converge quickly to the optimal solution. A solved linear system can be seen as a diagonalized system. For many applications data are multilinear, and we want to use that structure. A multilinear analogue to a diagonalized system is a rank decomposition. Alternating Least Squares is one standard method for decomposing tensors into a rank decomposition by attempting to reduce this problem into a sequence of least squares optimizations. While this method can be effective in some situations, it is limited because it doesn’t always converge, and it can be computationally expensive. However, when tensor data arrive sample by sample, we can use stochastic methods to attempt to decompose a model tensor from its samples. We show that under mild regularity and boundedness assumptions, the Stochastic Alternating Least Squares (SALS) method converges. Even though tensor problems often have high complexity, the tradeoff for using sampling in place of exactness can lead to large savings in time and resources. I’ll describe the SALS algorithm and its advantages, and I’ll give some hints as to why (and when) it converges.

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

September 24: Artur Manukyan (University of Massachusetts Medical School & Broad Institute of MIT and Harvard)

Title: Graph-based Learning for Class Cover Problem and Adaptive Clustering Algorithms using Statistical Tests of Spatial Data Analysis

Abstract: In statistical learning, numerous methods are based on graphs. A type of graphs, called proximity graph, offers solutions to many challenges in supervised and unsupervised learning. Class cover catch digraphs (CCCDs) are such digraphs that have been introduced to investigate the class cover problem (CCP). The goal of CCP is to find a set of hyperballs such that their union encapsulates, or covers, a subset of the training data from the class of interest. CCP is closely related to statistical classification, and CCCDs achieve relatively good performance in many tasks of statistical classification and clustering, such as imbalanced learning and hot spot detection. We mainly discuss the advantages of CCCDs in statistical learning, but we primarily focus on clustering algorithms which are based on recently developed unsupervised installations of CCCDs, called cluster catch digraphs (CCD). These digraphs are used to devise clustering methods that are hybrids of density-based and graph-based methods. CCDs are appealing digraphs for partitioning and clustering of data sets since they estimate the number of clusters without validation indices; however, CCDs, and density-based methods in general, require parameters representing the spatial intensity of assumed clusters exist in the data set. We offer parameter-free versions of the CCD algorithm that does not require specifying the spatial intensity parameter, whose choice is often critical to find an optimal partitioning of the data set. We approach the problem of estimating the number of clusters by borrowing a tool from spatial data analysis, namely Ripley's K function. We call our new digraphs based on the K function as R-CCDs. We show that the domination number of R-CCDs locate and separate the clusters from the noise clusters in data sets, and hence, allow the estimation of the true number of clusters. Our parameter-free clustering algorithms are composed of methods that estimate both the number of clusters and the spatial intensity parameter, making them completely parameter-free. We conduct Monte Carlo simulations and use real life experiments to compare R-CCDs with some commonly used density-based and prototype-based clustering methods.

This is joint work with Elvan Ceyhan (Mathematics and Statistics, Auburn University).

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

October 1: Todd Steury (Wildlife Ecology, Auburn University)

Title: Confounding effects: the most devilish problem in the sciences

Abstract: Collinearity and confounding effects are pervasive problems in many scientific fields, especially those in which manipulative experimental studies are not possible. Yet many scientists do not fully understand how collinearity and confounding effects influence their results or what to do about them. In this talk, I explain what collinearity and confounding effects are, how they influence the results of your statistical analyses and scientific results, and what can be done to address problems with confounding effects and collinearity. I use simple statistical examples and simulations to explain collinearity and confounding effects in terms that even non-statisticians should be able to understand. Finally, I give several examples from natural resource fields (my area of expertise) of how confounding effects have resulted in erroneous scientific conclusions in the past.

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

October 8: Alex Vinel (Industrial and Systems Engineering, Auburn University)

Title: Explaining and predicting unsafe driving events among commercial truck drivers: lessons learned from observing 20 million driving miles using IoT sensors

Abstract: Highway transportation safety is one of the most pressing global public health issues. With the emergence of a multitude of sources of relevant information (such as real-time weather, traffic and vehicle kinematic data) there is a potential for employing advanced data analytics techniques to help address this problem. In this talk we will discuss a set of studies that was based on a large data set concerning commercial trucking operations in the US. We will review approaches to characterizing the relationship between the risk factors and incident risk, first with the goal of statistically explaining this relationship, and then for the purpose of forecasting. In the former, we model the underlying time series, as a Bayesian hierarchical non-homogeneous Poisson process, showing that intensity of incidents (significantly) increases with more driving and (significantly) reduces after rest breaks. In the latter, we employ machine learning methods to evaluate the extent to which it is possible to predict probability of traffic incidents (and hence evaluate the risk associated with a particular route).

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

October 15: Santu Karmaker (Computer Science and Software Engineering, Auburn University)

Title: Data Science for "All"

Abstract: Big data is ubiquitous across domains, and more and more stakeholders are choosing to use machine learning to get the most out of their data. This pressing need has inspired researchers to develop tools for Automated machine learning (AutoML), which is essentially automating the process of applying machine learning to real-world problems. But although automation and efficiency are some of AutoML's main selling points, the process still requires a surprising level of human involvement from a Data Scientist and are still far from a “real automatic system”. As a result, AutoML tools are not yet directly usable by domain experts like doctors, business professionals, social scientists and so on, who have little / no knowledge of machine learning / data science. In summary, Data science is not yet open to “all”. How can we make data science more accessible to the general people? This talk will focus on this big question while discussing three independent yet related line of works. The first direction is about helping general users annotate big corpus of text data with minimal guidance from the users. The second direction is about enabling end users perform interesting semantic analysis of unstructured data without worrying about the underlying intricate details of machine learning. The third and final direction will be laying out a vision for Virtual Interactive Data Scientist, a natural dialog based intelligent agent which can be thought of as the future siri or alexa, which can assist users in solving real-life data science problems.

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

October 22: Whitney Huang (School of Mathematical and Statistical Sciences, Clemson University)

Title: Airflow Recovery from Thoracic and Abdominal Movements using Synchrosqueezing Transform and Locally Stationary Gaussian Process Regression

Abstract: Airflow signal encodes rich information about the respiratory system. While the gold standard for measuring airflow is to use a spirometer with an occlusive seal, this is not practical for ambulatory monitoring of patients. Advances in sensor technology have made measurement of motion of the thorax and abdomen feasible with small inexpensive devices, but estimation of airflow from these time series is challenging. We propose to use a nonlinear-type time-frequency analysis tool, synchrosqueezing transform, to properly represent the thoracic and abdominal movement signals as the features, which are used to recover the airflow by a locally stationary Gaussian process. We show that, using a dataset that contains respiratory signals under normal sleep conditions, an accurate prediction can be achieved by fitting the proposed model in the feature space both in the intra- and inter-subject setups. We also apply our method to a more challenging case, where subjects under general anesthesia underwent transitions from pressure support to unassisted ventilation to further demonstrate the utility of the proposed method.

This is joint work with Yu-Min Chung (Math and Stat, UNC-Greensboro), Yu-Bo Wang (SMSS, Clemson), Jeff Mandel (Anesthesiology & Critical Care, UPenn), and Hau-Tieng Wu (Math and Stat, Duke).

The preprint of this work can be found at: https://arxiv.org/pdf/2008.04473.pdf

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

October 29: Hyungsuk (Tak) Tak (Statistics & Astronomy and Astrophysics, Penn State)

Title: Time Delay Cosmography Towards The Hubble Constant

Abstract: The Hubble constant is a core cosmological parameter that represents the current expansion rate of the Universe. However, estimates for this quantity have been inconsistent. Astronomers have been concerned about this inconsistency, developing various methods to estimate the Hubble constant independently. One of such independent methods is time delay cosmography. This method is based on strong gravitational lensing, an effect that multiple images of the same astronomical object appear in the sky because paths of the images (from the object to the Earth) are bent by the strong gravitational field of an intervening galaxy. This strong gravitational lensing produces two types of the data; multiple time series data of brightness, and imaging data of lensing galaxy and lensed source. We use the time series data to infer time delays between the arrival times of the multiply-lensed images, and the imaging data to estimate gravitational potential of the lensing galaxy. The Hubble constant can be estimated with these quantities. In this talk, I explain the relationship among these three components, i.e., time delays, gravitational potential, and the Hubble constant, introducing data analytic challenges and our collaborative efforts to overcome these challenges.

Recording: click to access part 1 and part 2.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

November 5: Yang Zhou (Computer Science and Software Engineering, Auburn University)

Title: Adversarial Machine Learning for Robust Prediction

Abstract: With continued advances in science and technology, digital data have grown at an astonishing rate in various domains and forms, such as business, geography, health, multimedia, network, text, and web data. Designing and developing machine learning algorithms that are robust against missing data, incomplete observation, or errors in data collection is essential in many real-world applications. Despite achieving remarkable performance, machine learning models, especially deep learning models, suffer from harassment caused by small adversarial perturbations injected by malicious parties and users. Given the need to understand the vulnerability and resilience of machine learning, two questions arise: (1) How to develop effective modification 'attack' strategies to tamper with intrinsic characteristics of data by injecting fake information? and (2) How to develop defense strategies to offer sufficient protection to machine learning models against adversarial attacks?

In this talk, I will introduce problems, challenges, and solutions for characterizing and understanding and learning vulnerability and resilience of machine learning under adversarial attacks. I will also discuss our recent work on adversarial learning over network and text data. I will conclude the talk by sketching interesting future directions for adversarial machine learning.

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

November 12: Ernest Fokoué (School of Mathematical Sciences, Rochester Institute of Technology)

Title: On the Ubiquity of Kernels in Statistical Machine Learning

Abstract: In this lecture, I will present a general tour of some of the most commonly used kernel methods in statistical machine learning and data mining. I will touch on elements of artificial neural networks and then highlight their intricate connections to some general purpose kernel methods like Gaussian process learning machines. I will also resurrect the famous universal approximation theorem and will most likely ignite a [controversial] debate around the theme: could it be that [shallow] networks like radial basis function networks or Gaussian processes are all we need for well behaved functions? Do we really need many hidden layers as the hype around Deep Neural Network architectures seem to suggest or should heed Ockham’s principle of parsimony, namely “Entities should not be multiplied beyond necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”)

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
November 19: Mikael Kuusela (Statistics and Data Science, Carnegie Mellon University)

Title: Objective Frequentist Uncertainty Quantification for Atmospheric Carbon Dioxide Retrievals

Abstract: The steadily increasing amount of atmospheric carbon dioxide is having an unprecedented impact on the global climate system. In order to better understand the sources and sinks of CO2, NASA operates the Orbiting Carbon Observatory-2 & 3 instruments to monitor CO2 from space. These instruments measure the radiance of the sunlight reflected off the Earth's surface, which is then inverted to obtain CO2 estimates. In this work, we first analyze the current operational retrieval procedure, which uses a prior distribution to regularize the underlying ill-posed inverse problem, and demonstrate that the resulting uncertainties might be poorly calibrated both at individual locations and over a spatial region. To alleviate these issues, we propose a new method that uses known physical constraints and direct inversion of functionals of the CO2 profile to construct well-calibrated frequentist confidence intervals based on convex programming. Furthermore, we study the influence of individual nuisance variables on the length of the intervals and identify certain key variables that can greatly reduce the final uncertainty given additional deterministic or probabilistic constraints.

This is joint work with Pratik Patil (Carnegie Mellon University) and Jonathan Hobbs (Jet Propulsion Laboratory).

Recording: click here to access.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------