About me

I am a PhD candidate at University of Southern California advised by Prof. Aram Galstyan and Prof. Greg Ver Steeg. I do both applied and theoretical research on some aspects of deep learning, often taking an information-theoretic perspective. My main research directions are (a) studying information stored in neural network weights or activations and its connections to generalization, memorization, stability and learning dynamics; and (b) representation learning with the goal of enriching the learned representation with useful properties, such as minimality, disentanglement, modularity, reduced synergy, etc. More broadly, I am interested in generalization under domain shifts, unsupervised/self-supervised learning, studying the generalization phenomenon of deep neural networks, and in estimation/approximation of information-theoretic quantities or their alternatives.


  • [May 16, 2022] Started a summer internship at Google Research, New York. Will be working with Ankit Singh Rawat and Aditya Menon.
  • [March 2, 2022] Our work “Failure Modes of Domain Generalization Algorithms” was accepted to CVPR 2022.
  • [Sept. 28, 2021] Our work “Information-theoretic generalization bounds for black-box learning algorithms” was accepted to NeurIPS 2021.
  • [May 17, 2021] Started a summer internship at AWS Custom Labels team. Will be working with Alessandro Achille and Avinash Ravichandran.
  • [Jan. 12, 2021] Our work “Estimating informativeness of samples with Smooth Unique Information” got accepted to ICLR 2021.
  • [Oct. 20, 2020] Received a free NeurIPS 2020 registration by making it to the list of the top 10% of high-scoring reviewers.
  • [June 3, 2020] Our work “Improving generalization by controlling label-noise information in neural network weights” got accepted to ICML 2020.
  • [May 18, 2020] Starting a summer internship at AWS Custom Labels team. Going to work with Alessandro Achille, Avinash Ravichandran, and Orchid Majumder!

Publications and preprints

Domain generalization failure modes
Tigran Galstyan, Hrayr Harutyunyan, Hrant Khachatrian, Greg Ver Steeg, Aram Galstyan
Failure Modes of Domain Generalization Algorithms
arXiv preprint [arXiv, code 1 2, bibTeX]

We propose an evaluation framework for domain generalization algorithms that allows decomposition of the test error into components capturing distinct aspects of generalization. We show that the largest contributor to the generalization error varies across methods, datasets, regularization strengths and even training lengths. We observe two problems associated with the strategy of learning domain-invariant representations. On Colored MNIST, most domain generalization algorithms fail because they reach domain-invariance only on the training domains. On Camelyon-17, domain-invariance degrades the quality of representations on unseen domains. We hypothesize that focusing instead on tuning the classifier on top of a rich representation can be a promising direction.

f-CMI bounds
Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan
Information-theoretic generalization bounds for black-box learning algorithms
NeurIPS 2021 [arXiv, code, bibTeX]

We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.

sample information
Hrayr Harutyunyan, Alessandro Achille, Giovanni Paolini, Orchid Majumder, Avinash Ravichandran, Rahul Bhotika, Stefano Soatto
Estimating informativeness of samples with smooth unique information
ICLR 2021 [arXiv, code, bibTeX]

We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples.

avoiding memorization
Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan
Improving generalization by controlling label-noise information in neural network weights
ICML 2020 [arXiv, code, bibTeX]

In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. We show that one can prevent memorization and improve generalization by controlling the Shannon mutual information between weights and the vector of all training labels given inputs, I(w : y ∣ x). To minimize this information, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. Our approach yields drastic improvements over standard training algorithms (like cross-entropy loss), and outperform competitive approaches designed for learning with noisy labels.

modular latent factor model
Greg Ver Steeg, Hrayr Harutyunyan, Daniel Moyer, Aram Galstyan
Fast structure learning with modular regularization
NeurIPS'19 [arXiv, code, bibTeX]

We introduce a method, called linear CorEx, for learning latent factors such that the joint probability distribution becomes close to being a modular latent factor model (shown in the picture). The method has linear complexity w.r.t. the number of observed variables and works well in high-dimensional undersampled regimes. Furthermore, when the data comes from an approximately modular Gaussian latent factor model, linear CorEx exhibits blessing of dimensionality!

temporal CorEx
Hrayr Harutyunyan, Daniel Moyer, Hrant Khachatrian, Greg Ver Steeg, Aram Galstyan
Efficient Covariance Estimation from Temporal Data
arXiv preprint [arXiv, code, bibTeX]

In this work we extend linear CorEx to work with temporal data. The main method -- T-CorEx -- takes multivariate time series, divided into time periods and models the data of each time period with an instance of linear CorEx, such that the models vary smoothly over time. The method can be used for estimating covariance matrix of observed variables at each time period, clustering of time series, change point detection, and extracting useful information. All these analyses can be done in less than an hour even when the data is truly high-dimensional (like an fMRI instance with 10^5 variables and 20 time periods).

mixhop architecture
Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, Aram Galstyan
Mixhop: Higher-order graph convolution architectures via sparsified neighborhood mixing
ICML'19 [arXiv, code, bibTeX]

This paper proposes a new graph convolutional network (GCN), called MixHop, that in contrast to the vanilla GCN is able to learn a general class of neighborhood mixing relationships. MixHop requires no additional memory or computational complexity. Additionally, the paper proposes sparsity regularization that allows us to visualize how the network prioritizes neighborhood information across different graph datasets.

mimic benchmarking
Hrayr Harutyunyan, Hrant Khachatrian, David Kale, Greg Ver Steeg, Aram Galstyan
Multitask learning and benchmarking with clinical time series data
Nature, Scientific data 6 (1), 96 [arXiv, code, bibTeX]

The progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available MIMIC-III database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We propose strong linear and neural baselines for all four tasks and evaluate the effect of deep supervision, multitask training and data-specific architectural modifications on the performance of neural models.

disentangling via synergy minimization
Greg Ver Steeg, Rob Brekelmans, Hrayr Harutyunyan, Aram Galstyan
Disentangled representations via synergy minimization
Allerton'17 [arXiv, bibTeX]

If the factors comprising a representation allow us to make accurate predictions about our system, but obscuring any subset of the factors destroys our ability to make predictions, we say that the representation exhibits informational synergy. We argue that synergy is an undesirable feature in learned representations and that explicitly minimizing synergy can help disentangle the true factors of variation underlying data. We explore different ways of quantifying synergy, deriving new closed-form expressions in some cases, and then show how to modify learning to produce representations that are minimally synergistic. We introduce a benchmark task to disentangle separate characters from images of words. We demonstrate that Minimally Synergistic (MinSyn) representations correctly disentangle characters while methods relying on statistical independence fail.