Statistics & Data Science Seminar
Day & Time: Thursdays, at
2:00
pm - 3:00 pm
Themes:
Statistics & Data Science: their theory, methodology and
applications
Modality: Participants and speakers join via Zoom using the
link:
https://auburn.zoom.us/j/93758346031
Fall 2020
Schedule
Our Statistics & Data Science Seminars has
ended for Fall 2020
Upcoming Seminars
Our Statistics & Data Science Seminars will
resume in Spring 2021
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Past Seminars
September 10: Roberto Molinari (Mathematics and
Statistics, Auburn University)
Title:
SWAG: A Sparse Wrapper
Algorithm with Applications in Gene Selection
Abstract:
In genomics an important goal is to achieve high predictive (classification)
power using as few gene expressions as possible to facilitate replicability
and interpretability for research and diagnostic purposes. Given these
goals, the reliance on a single model/learner, possibly issued from a sparse
learning method, very probably doesn’t respond to these needs in many cases.
We put forward a heuristic greedy algorithm that allows to find a
set/library of extremely sparse and highly predictive learners and discuss
how this procedure could be modified to select learners based on other
(non-convex) utility functions.
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
September 17:
Luke Oeding (Mathematics and
Statistics, Auburn University)
Title:
Stochastic Alternating Least Squares for Tensor Decomposition
Abstract:
Least Squares is the standard method for approximating the solution to
overdetermined systems of linear equations. It is known to converge quickly
to the optimal solution. A solved linear system can be seen as a
diagonalized system. For many applications data are multilinear, and we want
to use that structure. A multilinear analogue to a diagonalized system is a
rank decomposition. Alternating Least Squares is one standard method for
decomposing tensors into a rank decomposition by attempting to reduce
this problem into a sequence of least squares optimizations. While
this method can be effective in some situations, it is limited because
it doesn’t always converge, and it can be computationally
expensive. However, when tensor data arrive sample by sample, we can
use stochastic methods to attempt to decompose a model tensor from
its samples. We show that under mild regularity and boundedness assumptions,
the Stochastic Alternating Least Squares (SALS) method converges. Even
though tensor problems often have high complexity, the tradeoff for using
sampling in place of exactness can lead to large savings in time and
resources. I’ll describe the SALS algorithm and its advantages, and I’ll
give some hints as to why (and when) it converges.
Recording: click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
September 24: Artur Manukyan (University of
Massachusetts Medical School & Broad Institute of MIT and Harvard)
Title: Graph-based Learning for
Class Cover Problem and Adaptive Clustering Algorithms using Statistical
Tests of Spatial Data Analysis
Abstract:
In statistical learning, numerous methods
are based on graphs. A type of graphs, called proximity graph, offers
solutions to many challenges in supervised and unsupervised learning. Class
cover catch digraphs (CCCDs) are such digraphs that have been introduced to
investigate the class cover problem (CCP). The goal of CCP is to find a set
of hyperballs such that their union encapsulates, or covers, a subset of the
training data from the class of interest. CCP is closely related to
statistical classification, and CCCDs achieve relatively good performance in
many tasks of statistical classification and clustering, such as imbalanced
learning and hot spot detection. We mainly discuss the advantages of CCCDs
in statistical learning, but we primarily focus on clustering algorithms
which are based on recently developed unsupervised installations of CCCDs,
called cluster catch digraphs (CCD). These digraphs are used to devise
clustering methods that are hybrids of density-based and graph-based
methods. CCDs are appealing digraphs for partitioning and clustering of data
sets since they estimate the number of clusters without validation indices;
however, CCDs, and density-based methods in general, require parameters
representing the spatial intensity of assumed clusters exist in the data
set. We offer parameter-free versions of the CCD algorithm that does not
require specifying the spatial intensity parameter, whose choice is often
critical to find an optimal partitioning of the data set. We approach the
problem of estimating the number of clusters by borrowing a tool from
spatial data analysis, namely Ripley's K function. We call our new digraphs
based on the K function as R-CCDs. We show that the domination number of
R-CCDs locate and separate the clusters from the noise clusters in data
sets, and hence, allow the estimation of the true number of clusters. Our
parameter-free clustering algorithms are composed of methods that estimate
both the number of clusters and the spatial intensity parameter, making them
completely parameter-free. We conduct Monte Carlo simulations and use real
life experiments to compare R-CCDs with some commonly used density-based and
prototype-based clustering methods.
This is joint work with Elvan Ceyhan (Mathematics and
Statistics, Auburn University).
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
October 1: Todd Steury
(Wildlife
Ecology, Auburn
University)
Title: Confounding effects: the most devilish problem in the sciences
Abstract: Collinearity and confounding effects are pervasive problems in many scientific fields, especially those in which manipulative experimental studies are not possible. Yet many scientists do not fully understand how collinearity and confounding effects influence their results or what to do about them. In this talk, I explain what collinearity and confounding effects are, how they influence the results of your statistical analyses and scientific results, and what can be done to address problems with confounding effects and collinearity. I use simple statistical examples and simulations to explain collinearity and confounding effects in terms that even non-statisticians should be able to understand. Finally, I give several examples from natural resource fields (my area of expertise) of how confounding effects have resulted in erroneous scientific conclusions in the past.
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
October 8: Alex
Vinel (Industrial and Systems Engineering, Auburn University)
Title: Explaining and predicting unsafe driving events among commercial
truck drivers: lessons learned from observing 20 million driving miles using
IoT sensors
Abstract: Highway transportation safety
is one of the most pressing global public health issues. With the emergence
of a multitude of sources of relevant information (such as real-time
weather, traffic and vehicle kinematic data) there is a potential for
employing advanced data analytics techniques to help address this problem.
In this talk we will discuss a set of studies that was based on a large data
set concerning commercial trucking operations in the US. We will review
approaches to characterizing the relationship between the risk factors and
incident risk, first with the goal of statistically explaining this
relationship, and then for the purpose of forecasting. In the former, we
model the underlying time series, as a Bayesian hierarchical non-homogeneous
Poisson process, showing that intensity of incidents (significantly)
increases with more driving and (significantly) reduces after rest breaks.
In the latter, we employ machine learning methods to evaluate the extent to
which it is possible to predict probability of traffic incidents (and hence
evaluate the risk associated with a particular route).
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
October 15:
Santu Karmaker
(Computer
Science and Software Engineering,
Auburn
University)
Title: Data Science for "All"
Abstract:
Big data is ubiquitous across domains, and more
and more stakeholders are choosing to use machine learning to get the most
out of their data. This pressing need has inspired researchers to develop
tools for Automated machine learning (AutoML), which is essentially
automating the process of applying machine learning to real-world problems.
But although automation and efficiency are some of AutoML's main selling
points, the process still requires a surprising level of human involvement
from a Data Scientist and are still far from a “real automatic system”. As a
result, AutoML tools are not yet directly usable by domain experts like
doctors, business professionals, social scientists and so on, who have
little / no knowledge of machine learning / data science. In summary, Data
science is not yet open to “all”. How can we make data science more
accessible to the general people? This talk will focus on this big question
while discussing three independent yet related line of works. The first
direction is about helping general users annotate big corpus of text data
with minimal guidance from the users. The second direction is about enabling
end users perform interesting semantic analysis of unstructured data without
worrying about the underlying intricate details of machine learning. The
third and final direction will be laying out a vision for Virtual
Interactive Data Scientist, a natural dialog based intelligent agent which
can be thought of as the future siri or alexa, which can assist users in
solving real-life data science problems.
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
October 22: Whitney Huang (School
of Mathematical and Statistical Sciences, Clemson University)
Title: Airflow Recovery from Thoracic and Abdominal
Movements using Synchrosqueezing Transform and Locally Stationary Gaussian
Process Regression
Abstract:
Airflow
signal encodes rich information about the respiratory system. While the gold
standard for measuring airflow is to use a spirometer with an occlusive
seal, this is not practical for ambulatory monitoring of patients. Advances
in sensor technology have made measurement of motion of the thorax and
abdomen feasible with small inexpensive devices, but estimation of airflow
from these time series is challenging. We propose to use a nonlinear-type
time-frequency analysis tool, synchrosqueezing transform, to properly
represent the thoracic and abdominal movement signals as the features, which
are used to recover the airflow by a locally stationary Gaussian process. We
show that, using a dataset that contains respiratory signals under normal
sleep conditions, an accurate prediction can be achieved by fitting the
proposed model in the feature space both in the intra- and inter-subject
setups. We also apply our method to a more challenging case, where subjects
under general anesthesia underwent transitions from pressure support to
unassisted ventilation to further demonstrate the utility of the proposed
method.
This is joint work with Yu-Min Chung (Math and Stat,
UNC-Greensboro), Yu-Bo Wang (SMSS, Clemson), Jeff Mandel (Anesthesiology &
Critical Care, UPenn), and Hau-Tieng Wu (Math and Stat, Duke).
The preprint of this work can be found at: https://arxiv.org/pdf/2008.04473.pdf
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
October 29: Hyungsuk (Tak) Tak (Statistics & Astronomy and
Astrophysics, Penn State)
Title: Time Delay
Cosmography Towards The Hubble Constant
Abstract:
The Hubble constant is a core cosmological parameter that represents the
current expansion rate of the Universe. However, estimates for this quantity
have been inconsistent. Astronomers have been concerned about this
inconsistency, developing various methods to estimate the Hubble constant
independently. One of such independent methods is time delay cosmography.
This method is based on strong gravitational lensing, an effect that
multiple images of the same astronomical object appear in the sky because
paths of the images (from the object to the Earth) are bent by the strong
gravitational field of an intervening galaxy. This strong gravitational
lensing produces two types of the data; multiple time series data of
brightness, and imaging data of lensing galaxy and lensed source. We use the
time series data to infer time delays between the arrival times of the
multiply-lensed images, and the imaging data to estimate gravitational
potential of the lensing galaxy. The Hubble constant can be estimated with
these quantities. In this talk, I explain the relationship among these three
components, i.e., time delays, gravitational potential, and the Hubble
constant, introducing data analytic challenges and our collaborative efforts
to overcome these challenges.
Recording:
click to access
part 1 and
part 2.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
November 5:
Yang Zhou (Computer
Science and Software Engineering, Auburn
University)
Title:
Adversarial Machine Learning for Robust Prediction
Abstract: With
continued advances in science and technology, digital data have grown at an
astonishing rate in various domains and forms, such as business, geography,
health, multimedia, network, text, and web data. Designing and developing
machine learning algorithms that are robust against missing data, incomplete
observation, or errors in data collection is essential in many real-world
applications. Despite achieving remarkable performance, machine learning
models, especially deep learning models, suffer from harassment caused by
small adversarial perturbations injected by malicious parties and users.
Given the need to understand the vulnerability and resilience of machine
learning, two questions arise: (1) How to develop effective modification
'attack' strategies to tamper with intrinsic characteristics of data by
injecting fake information? and (2) How to develop defense strategies to
offer sufficient protection to machine learning models against adversarial
attacks?
In this talk, I will introduce problems, challenges, and
solutions for characterizing and understanding and learning vulnerability
and resilience of machine learning under adversarial attacks. I will also
discuss our recent work on adversarial learning over network and text data.
I will conclude the talk by sketching interesting future directions for
adversarial machine learning.
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
November 12:
Ernest Fokoué (School of Mathematical Sciences, Rochester Institute of
Technology)
Title:
On the Ubiquity of Kernels in Statistical Machine
Learning
Abstract: In
this lecture, I will present a general tour of some of the most commonly
used kernel methods in statistical machine learning and data mining. I will
touch on elements of artificial neural networks and then highlight their
intricate connections to some general purpose kernel methods like Gaussian
process learning machines. I will also resurrect the famous universal
approximation theorem and will most likely ignite a [controversial] debate
around the theme: could it be that [shallow] networks like radial basis
function networks or Gaussian processes are all we need for well behaved
functions? Do we really need many hidden layers as the hype around Deep
Neural Network architectures seem to suggest or should heed Ockham’s
principle of parsimony, namely “Entities should not be multiplied beyond
necessity.” (“Entia non sunt multiplicanda praeter necessitatem.”)
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
November 19: Mikael Kuusela
(Statistics and Data Science, Carnegie Mellon University)
Title:
Objective Frequentist Uncertainty Quantification for Atmospheric Carbon
Dioxide Retrievals
Abstract: The steadily increasing
amount of atmospheric carbon dioxide is having an unprecedented impact on
the global climate system. In order to better understand the sources and
sinks of CO2, NASA operates the Orbiting Carbon Observatory-2 & 3
instruments to monitor CO2 from space. These instruments measure the
radiance of the sunlight reflected off the Earth's surface, which is then
inverted to obtain CO2 estimates. In this work, we first analyze the current
operational retrieval procedure, which uses a prior distribution to
regularize the underlying ill-posed inverse problem, and demonstrate that
the resulting uncertainties might be poorly calibrated both at individual
locations and over a spatial region. To alleviate these issues, we propose a
new method that uses known physical constraints and direct inversion of
functionals of the CO2 profile to construct well-calibrated frequentist
confidence intervals based on convex programming. Furthermore, we study the
influence of individual nuisance variables on the length of the intervals
and identify certain key variables that can greatly reduce the final
uncertainty given additional deterministic or probabilistic constraints.
This is joint work with Pratik Patil (Carnegie Mellon University) and
Jonathan Hobbs (Jet Propulsion Laboratory).
Recording:
click here to access.
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------