About me

I am a Research Scientist at Google DeepMind (GenAI), advancing the frontiers of large language models through novel architectures and efficient pretraining strategies. My work at Google has included extending model context length by over 30x, contributing to the Gemini Nano models, and enhancing YouTube search with state-of-the-art AI. More broadly, I am interested in in-context learning, long context, length generalization, and reasoning capabilities of LLMs.

I obtained my Ph.D. in Computer Science from the University of Southern California, where I was fortunate to be advised by Aram Galstyan and Greg Ver Steeg. My thesis was on information stored in neural network weights or activations, and its connections to generalization, memorization, stability and learning dynamics. Prior to USC, I received my M.S. and B.S. degrees in Applied Mathematics and Computer Science from Yerevan State University. When I’m not training models, I enjoy reading, cinema, playing pool, and exploring NYC.

Recent Highlights

[Sept 2025] Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation accepted at NeurIPS 2025!
[Summer 2025] Hosted interns M. Emrullah ILDIZ and Zhiweu Xu. Worked with another intern: Hanseul Cho. Expect exciting research on improved transformer architecture and in-context learning!
[Summer 2024] Worked with 2 interns: Asher Trockman and Sangmin Bae.
[Feb, 2023] Presented my work [slides] at Rising Stars in AI Symposium 2023 at KAUST in Saudi Arabia.

Publications and preprints

Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan,, Ziwei Ji, Aaron Courville, Se-Young Yun
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
NeurIPS 2025, [paper, bibTeX]

Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.

Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak
Continuous Chain of Thought Enables Parallel Exploration and Reasoning
ICML 2025 TokShop Workshop, [paper, bibTeX]

Modern language models generate chain-of-thought traces by autoregressively sampling tokens from a finite vocabulary. While this discrete sampling has achieved remarkable success, conducting chain-of-thought with continuously-valued tokens (CoT2) offers a richer and more expressive alternative. Our work provides new theoretical guarantees and algorithms for CoT2, motivated by logical reasoning tasks that inherently require search capabilities. Theoretically, we establish how CoT2 facilitates the model to track multiple discrete traces in parallel; and quantify the level of achievable parallelism and its benefits for inference efficiency. We also provide a CoT2-based one-layer transformer construction that solves the combinatorial "subset sum problem" given a sufficient embedding dimension. These insights arise from a novel and effective supervision strategy where we match the language model outputs to the empirical token distributions of a set of target traces. Complementing this, we introduce sampling strategies that unlock policy optimization methods for CoT2. Our primary strategy samples and composes K discrete tokens at each decoding step to control the level of parallelism. Experiments confirm that (i) the optimal level of parallelism is governed by the embedding dimension, (ii) our continuous supervision strategy can outperform alternative methods, and (iii) policy optimization with CoT2 indeed improves the performance of the model beyond its initial discrete or continuous supervision.

Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
ICLR 2025, [paper, bibTeX]

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller "Recursive Transformers" that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines---and can even recover most of the performance of the original "full-size" model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting. In a theoretical analysis, we show that this has the potential to lead to significant (2-3x) gains in inference throughput.

Ankit Singh Rawat, Veeranjaneyulu Sadhanala, Afshin Rostamizadeh, Ayan Chakrabarti, Wittawat Jitkrittum, Vladimir Feinberg, Seungyeon Kim, Hrayr Harutyunyan, Nikunj Saunshi, Zachary Nado, Rakesh Shivanna, Sashank J. Reddi, Aditya Krishna Menon, Rohan Anil, Sanjiv Kumar
A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs
arXiv preprint, [paper, bibTeX]

A primary challenge in large language model (LLM) development is their onerous pre-training cost. Typically, such pre-training involves optimizing a self-supervised objective (such as next-token prediction) over a large corpus. This paper explores a promising paradigm to improve LLM pre-training efficiency and quality by suitably leveraging a small language model (SLM). In particular, this paradigm relies on an SLM to both (1) provide soft labels as additional training supervision, and (2) select a small subset of valuable ("informative" and "hard") training examples. Put together, this enables an effective transfer of the SLM's predictive distribution to the LLM, while prioritizing specific regions of the training data distribution. Empirically, this leads to reduced LLM training time compared to standard training, while improving the overall quality. Theoretically, we develop a statistical framework to systematically study the utility of SLMs in enabling efficient training of high-quality LLMs. In particular, our framework characterizes how the SLM's seemingly low-quality supervision can enhance the training of a much more capable LLM. Furthermore, it also highlights the need for an adaptive utilization of such supervision, by striking a balance between the bias and variance introduced by the SLM-provided soft labels. We corroborate our theoretical framework by improving the pre-training of an LLM with 2.8B parameters by utilizing a smaller LM with 1.5B parameters on the Pile dataset.

Mimetic initialization helps state space models learn to recall

Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar, Srinadh Bhojanapalli
Mimetic Initialization Helps State Space Models Learn to Recall
ICLR 2025 Workshop Weight Space Learning, [paper, bibTeX]

Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Hrayr Harutyunyan, Rafayel Darbinyan, Samvel Karapetyan, Hrant Khachatrian
In-context Learning in Presence of Spurious Correlations
Under review at TMLR, [paper, bibTeX]

Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.

Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar
Supervision Complexity and its Role in Knowledge Distillation
ICLR 2023, [paper, bibTeX]

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student’s margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

Limitations of sample-wise IT gen. bounds

Hrayr Harutyunyan, Greg Ver Steeg, Aram Galstyan
Formal limitations of sample-wise information-theoretic generalization bounds
IEEE Information Theory Workshop 2022 [arXiv, bibTeX]

Some of the tightest information-theoretic generalization bounds depend on the average information between the learned hypothesis and a single training example. However, these sample-wise bounds were derived only for expected generalization gap. We show that even for expected squared generalization gap no such sample-wise information-theoretic bounds exist. The same is true for PAC-Bayes and single-draw bounds. Remarkably, PAC-Bayes, single-draw and expected squared generalization gap bounds that depend on information in pairs of examples exist.

Tigran Galstyan, Hrayr Harutyunyan, Hrant Khachatrian, Greg Ver Steeg, Aram Galstyan
Failure Modes of Domain Generalization Algorithms
CVPR 2021 [arXiv, code 1 2, bibTeX]

We propose an evaluation framework for domain generalization algorithms that allows decomposition of the test error into components capturing distinct aspects of generalization. We show that the largest contributor to the generalization error varies across methods, datasets, regularization strengths and even training lengths. We observe two problems associated with the strategy of learning domain-invariant representations. On Colored MNIST, most domain generalization algorithms fail because they reach domain-invariance only on the training domains. On Camelyon-17, domain-invariance degrades the quality of representations on unseen domains. We hypothesize that focusing instead on tuning the classifier on top of a rich representation can be a promising direction.

Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan
Information-theoretic generalization bounds for black-box learning algorithms
NeurIPS 2021 [arXiv, code, bibTeX]

We derive information-theoretic generalization bounds for supervised learning algorithms based on the information contained in predictions rather than in the output of the training algorithm. These bounds improve over the existing information-theoretic bounds, are applicable to a wider range of algorithms, and solve two key challenges: (a) they give meaningful results for deterministic algorithms and (b) they are significantly easier to estimate. We show experimentally that the proposed bounds closely follow the generalization gap in practical scenarios for deep learning.

Hrayr Harutyunyan, Alessandro Achille, Giovanni Paolini, Orchid Majumder, Avinash Ravichandran, Rahul Bhotika, Stefano Soatto
Estimating informativeness of samples with smooth unique information
ICLR 2021 [arXiv, code, bibTeX]

We define a notion of information that an individual sample provides to the training of a neural network, and we specialize it to measure both how much a sample informs the final weights and how much it informs the function computed by the weights. Though related, we show that these quantities have a qualitatively different behavior. We give efficient approximations of these quantities using a linearized network and demonstrate empirically that the approximation is accurate for real-world architectures, such as pre-trained ResNets. We apply these measures to several problems, such as dataset summarization, analysis of under-sampled classes, comparison of informativeness of different data sources, and detection of adversarial and corrupted examples.

Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan
Improving generalization by controlling label-noise information in neural network weights
ICML 2020 [arXiv, code, bibTeX]

In the presence of noisy or incorrect labels, neural networks have the undesirable tendency to memorize information about the noise. We show that one can prevent memorization and improve generalization by controlling the Shannon mutual information between weights and the vector of all training labels given inputs, I(w ; y ∣ x). To minimize this information, we propose training algorithms that employ an auxiliary network that predicts gradients in the final layers of a classifier without accessing labels. Our approach yields drastic improvements over standard training algorithms (like cross-entropy loss), and outperform competitive approaches designed for learning with noisy labels.

Greg Ver Steeg, Hrayr Harutyunyan, Daniel Moyer, Aram Galstyan
Fast structure learning with modular regularization
NeurIPS'19 [arXiv, code, bibTeX]

We introduce a method, called linear CorEx, for learning latent factors such that the joint probability distribution becomes close to being a modular latent factor model (shown in the picture). The method has linear complexity w.r.t. the number of observed variables and works well in high-dimensional undersampled regimes. Furthermore, when the data comes from an approximately modular Gaussian latent factor model, linear CorEx exhibits blessing of dimensionality!

Hrayr Harutyunyan, Daniel Moyer, Hrant Khachatrian, Greg Ver Steeg, Aram Galstyan
Efficient Covariance Estimation from Temporal Data
arXiv preprint [arXiv, code, bibTeX]

In this work we extend linear CorEx to work with temporal data. The main method -- T-CorEx -- takes multivariate time series, divided into time periods and models the data of each time period with an instance of linear CorEx, such that the models vary smoothly over time. The method can be used for estimating covariance matrix of observed variables at each time period, clustering of time series, change point detection, and extracting useful information. All these analyses can be done in less than an hour even when the data is truly high-dimensional (like an fMRI instance with 10^5 variables and 20 time periods).

Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, Aram Galstyan
Mixhop: Higher-order graph convolution architectures via sparsified neighborhood mixing
ICML'19 [arXiv, code, bibTeX]

This paper proposes a new graph convolutional network (GCN), called MixHop, that in contrast to the vanilla GCN is able to learn a general class of neighborhood mixing relationships. MixHop requires no additional memory or computational complexity. Additionally, the paper proposes sparsity regularization that allows us to visualize how the network prioritizes neighborhood information across different graph datasets.

Hrayr Harutyunyan, Hrant Khachatrian, David Kale, Greg Ver Steeg, Aram Galstyan
Multitask learning and benchmarking with clinical time series data
Nature, Scientific data 6 (1), 96 [arXiv, code, bibTeX]

The progress in machine learning for healthcare research has been difficult to measure because of the absence of publicly available benchmark data sets. To address this problem, we propose four clinical prediction benchmarks using data derived from the publicly available MIMIC-III database. These tasks cover a range of clinical problems including modeling risk of mortality, forecasting length of stay, detecting physiologic decline, and phenotype classification. We propose strong linear and neural baselines for all four tasks and evaluate the effect of deep supervision, multitask training and data-specific architectural modifications on the performance of neural models.

Greg Ver Steeg, Rob Brekelmans, Hrayr Harutyunyan, Aram Galstyan
Disentangled representations via synergy minimization
Allerton'17 [arXiv, bibTeX]

If the factors comprising a representation allow us to make accurate predictions about our system, but obscuring any subset of the factors destroys our ability to make predictions, we say that the representation exhibits informational synergy. We argue that synergy is an undesirable feature in learned representations and that explicitly minimizing synergy can help disentangle the true factors of variation underlying data. We explore different ways of quantifying synergy, deriving new closed-form expressions in some cases, and then show how to modify learning to produce representations that are minimally synergistic. We introduce a benchmark task to disentangle separate characters from images of words. We demonstrate that Minimally Synergistic (MinSyn) representations correctly disentangle characters while methods relying on statistical independence fail.