Reinforcement learning

continuous state-action space

Taming "data-hungry" reinforcement learning? Stability in continuous state-action spaces

Duan, Yaqi, and Wainwright, Martin J.

Advances in Neural Information Processing Systems (NeurIPS) 2024

Abs arXiv

We introduce a novel framework for analyzing reinforcement learning (RL) in continuous state-action spaces, and use it to prove fast rates of convergence in both off-line and on-line settings. Our analysis highlights two key stability properties, relating to how changes in value functions and/or policies affect the Bellman operator and occupation measures. We argue that these properties are satisfied in many continuous state-action Markov decision processes, and demonstrate how they arise naturally when using linear function approximation methods. Our analysis offers fresh perspectives on the roles of pessimism and optimism in off-line and on-line RL, and highlights the connection between off-line RL and transfer learning.

policy evaluation

Optimal policy evaluation using kernel-based temporal difference methods

Duan, Yaqi, Wang, Mengdi, and Wainwright, Martin J.

The Annals of Statistics 2024

Abs arXiv

We study non-parametric methods for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We establish non-asymptotic bounds on the statistical error of a kernel-based least-squares temporal difference (LSTD) estimate, which can be understood either as a non-parametric instrumental variables method, or as a projected approximation to the Bellman fixed point equation. Our analysis imposes no assumptions on the transition operator of the Markov chain, but rather only conditions on the reward function and population-level kernel LSTD solutions. Using empirical process theory and concentration inequalities, we establish a non-asymptotic upper bound on the error with explicit dependence on the effective horizon H = 1/(1-γ) of the Markov reward process, the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H. Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.
Policy evaluation from a single path: Multi-step methods, mixing and mis-specification

Duan, Yaqi, and Wainwright, Martin J.

arXiv Preprint 2022

Abs arXiv

We study non-parametric estimation of the value function of an infinite-horizon γ-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical K-step look-ahead TD for K = 1, 2, ... and the TD(λ) family for λ∈[0,1) as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.
Minimax-optimal off-policy evaluation with linear function approximation

Duan, Yaqi, and Wang, Mengdi

International Conference on Machine Learning (ICML) 2020

Abs arXiv

This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history generated by unknown behavioral policies. We study a regression-based fitted Q iteration method, and show that it is equivalent to a model-based method that estimates a conditional mean embedding of the transition operator. We prove that this method is information-theoretically optimal and has nearly minimal estimation error. In particular, by leveraging contraction property of Markov processes and martingale concentration, we establish a finite-sample instance-dependent error upper bound and a nearly-matching minimax lower bound. The policy evaluation error depends sharply on a restricted χ^2-divergence over the function class between the long-term distribution of the target policy and the distribution of past data. This restricted χ^2-divergence is both instance-dependent and function-class-dependent. It characterizes the statistical limit of off-policy evaluation. Further, we provide an easily computable confidence bound for the policy evaluator, which may be useful for optimistic planning and safe policy improvement.
A finite-sample analysis of multi-step temporal difference estimates

Duan, Yaqi, and Wainwright, Martin J.

Learning for Dynamics and Control Conference (L4DC) 2023

Abs HTML

We consider the problem of estimating the value function of an infinite-horizon γ-discounted Markov reward process (MRP). We establish non-asymptotic guarantees for a general family of multi-step temporal difference (TD) estimates, including canonical K-step look-ahead TD for K=1,2,... and the TD(λ) family for λ∈[0,1) as special cases. Our bounds capture the dependence of these estimates on both the variance as defined by Bellman fluctuations, and the bias arising from possible model mis-specification. Our results reveal that the variance component shows limited sensitivity to the choice of look-ahead defining the estimator itself, while increasing the look-ahead can reduce the bias term. This highlights the benefit of using a larger look-ahead: it reduces bias but need not increase the variance.
Bootstrapping fitted Q-evaluation for off-policy inference

Hao, B., Ji, X., Duan, Y., Lu, H., Szepesvári, C., and Wang, M.

International Conference on Machine Learning (ICML) 2021

Abs arXiv

Bootstrapping provides a flexible and effective approach for assessing the quality of batch reinforcement learning, yet its theoretical property is less understood. In this paper, we study the use of bootstrapping in off-policy evaluation (OPE), and in particular, we focus on the fitted Q-evaluation (FQE) that is known to be minimax-optimal in the tabular and linear-model cases. We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is asymptotically efficient and distributionally consistent for off-policy statistical inference. To overcome the computation limit of bootstrapping, we further adapt a subsampling procedure that improves the runtime by an order of magnitude. We numerically evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.

off-line reinforcement learning

Risk bounds and Rademacher complexity in batch reinforcement learning

Duan, Yaqi, Jin, Chi, and Li, Zhiyuan (α-β)

International Conference on Machine Learning (ICML) 2021

Abs arXiv

This paper considers batch reinforcement learning (RL) with general value function approximation. Our study investigates the minimal assumptions to reliably estimate/minimize Bellman error, and characterizes the generalization performance by (local) Rademacher complexities of general function classes, which makes initial steps in bridging the gap between statistical learning theory and batch RL. Concretely, we view the Bellman error as a surrogate loss for the optimality gap, and prove the followings: (1) In double sampling regime, the excess risk of Empirical Risk Minimizer (ERM) is bounded by the Rademacher complexity of the function class. (2) In the single sampling regime, sample-efficient risk minimization is not possible without further assumptions, regardless of algorithms. However, with completeness assumptions, the excess risk of FQI and a minimax style algorithm can be again bounded by the Rademacher complexity of the corresponding function classes. (3) Fast statistical rates can be achieved by using tools of local Rademacher complexity. Our analysis covers a wide range of function classes, including finite classes, linear spaces, kernel spaces, sparse linear features, etc.
Near-optimal offline reinforcement learning with linear representation: leveraging variance information with pessimism

Yin, Ming, Duan, Yaqi, Wang, Mengdi, and Wang, Yu-Xiang

International Conference on Learning Representations (ICLR) 2022

Abs HTML

Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decision-making strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound, we design the variance-aware pessimistic value iteration (VAPVI), which adopts the conditional variance information of the value function for time-inhomogeneous episodic linear Markov decision processes (MDPs). VAPVI uses the estimated variances of the value functions to reweight the Bellman residuals in the least-square pessimistic value iteration and provides improved offline learning bounds over the existing best-known results (whereas the Bellman residuals are equally weighted). More importantly, our learning bounds are expressed in terms of system quantities, which provide natural instance-dependent characterizations that previous results are short of. We hope our results draw a clearer picture of what offline learning should look like when linear representations are provided.
Sparse feature selection makes batch reinforcement learning more sample efficient

Hao, B., Duan, Y., Lattimore, T., Szepesvári, C., and Wang, M.

International Conference on Machine Learning (ICML) 2021

Abs arXiv

This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient. We first consider the off-policy policy evaluation problem. To evaluate a new target policy, we analyze a Lasso fitted Q-evaluation method and establish a finite-sample error bound that has no polynomial dependence on the ambient dimension. To reduce the Lasso bias, we further propose a post model-selection estimator that applies fitted Q-evaluation to the features selected via group Lasso. Under an additional signal strength assumption, we derive a sharper instance-dependent error bound that depends on a divergence function measuring the distribution mismatch between the data distribution and occupancy measure of the target policy. Further, we study the Lasso fitted Q-iteration for batch policy optimization and establish a finite-sample error bound depending on the ratio between the number of relevant features and restricted minimal eigenvalue of the data’s covariance. In the end, we complement the results with minimax lower bounds for batch-data policy evaluation/optimization that nearly match our upper bounds. The results suggest that having well-conditioned data is crucial for sparse batch policy learning.

Multi-task learning

multi-task learning

Adaptive and robust multi-task learning

Duan, Yaqi, and Wang, Kaizheng (α-β)

The Annals of Statistics 2023

Abs arXiv

We study the multi-task learning problem that aims to simultaneously analyze multiple datasets collected from different sources and learn one model for each of them. We propose a family of adaptive methods to automatically utilize possible similarities among those tasks while carefully handling their differences. We derive optimal statistical guarantees for the methods and prove their robustness against outlier tasks. Numerical experiments on synthetic and real datasets demonstrate the efficacies of our new methods.

Dimensionality reduction

state aggregation

State aggregation learning from Markov transition data

Duan, Yaqi, Ke, Zheng (Tracy), and Wang, Mengdi

Advances in Neural Information Processing Systems (NeurIPS) 2019

Abs arXiv

State aggregation is a popular model reduction method rooted in optimal control. It reduces the complexity of engineering systems by mapping the system’s states into a small number of meta-states. The choice of aggregation map often depends on the data analysts’ knowledge and is largely ad hoc. In this paper, we propose a tractable algorithm that estimates the probabilistic aggregation map from the system’s trajectory. We adopt a soft-aggregation model, where each meta-state has a signature raw state, called an anchor state. This model includes several common state aggregation models as special cases. Our proposed method is a simple two-step algorithm: The first step is spectral decomposition of empirical transition matrix, and the second step conducts a linear transformation of singular vectors to find their approximate convex hull. It outputs the aggregation distributions and disaggregation distributions for each meta-state in explicit forms, which are not obtainable by classical spectral methods. On the theoretical side, we prove sharp error bounds for estimating the aggregation and disaggregation distributions and for identifying anchor states. The analysis relies on a new entry-wise deviation bound for singular vectors of the empirical transition matrix of a Markov process, which is of independent interest and cannot be deduced from existing literature. The application of our method to Manhattan traffic data successfully generates a data-driven state aggregation map with nice interpretations.
Adaptive low-nonnegative-rank approximation for state aggregation of Markov chains

Duan, Yaqi, Wang, Mengdi, Wen, Zaiwen, and Yuan, Yaxiang

SIAM Journal on Matrix Analysis and Applications 2020

Abs arXiv

This paper develops a low-nonnegative-rank approximation method to identify the state aggregation structure of a finite-state Markov chain under an assumption that the state space can be mapped into a handful of meta-states. The number of meta-states is characterized by the nonnegative rank of the Markov transition matrix. Motivated by the success of the nuclear norm relaxation in low rank minimization problems, we propose an atomic regularizer as a convex surrogate for the nonnegative rank and formulate a convex optimization problem. Because the atomic regularizer itself is not computationally tractable, we instead solve a sequence of problems involving a nonnegative factorization of the Markov transition matrices by using the proximal alternating linearized minimization method. Two methods for adjusting the rank of factorization are developed so that local minima are escaped. One is to append an additional column to the factorized matrices, which can be interpreted as an approximation of a negative subgradient step. The other is to reduce redundant dimensions by means of linear combinations. Overall, the proposed algorithm very likely converges to the global solution. The efficiency and statistical properties of our approach are illustrated on synthetic data. We also apply our state aggregation algorithm on a Manhattan transportation data set and make extensive comparisons with an existing method.

state embedding

Learning low-dimensional state embeddings and metastable clusters from time series data

Sun, Yifan, Duan, Yaqi, Gong, Hao, and Wang, Mengdi

Advances in Neural Information Processing Systems (NeurIPS) 2019

Abs arXiv

This paper studies how to find compact state embeddings from high-dimensional Markov state trajectories, where the transition kernel has a small intrinsic rank. In the spirit of diffusion map, we propose an efficient method for learning a low-dimensional state embedding and capturing the process’s dynamics. This idea also leads to a kernel reshaping method for more accurate nonparametric estimation of the transition function. State embedding can be used to cluster states into metastable sets, thereby identifying the slow dynamics. Sharp statistical error bounds and misclassification rate are proved. Experiment on a simulated dynamical system shows that the state clustering method indeed reveals metastable structures. We also experiment with time series generated by layers of a Deep-Q-Network when playing an Atari game. The embedding method identifies game states to be similar if they share similar future events, even though their raw data are far different.
Learning good state and action representations for Markov decision process via tensor decomposition

Ni, Chengzhuo, Duan, Yaqi, Dahleh, Munther, Wang, Mengdi, and Zhang, Anru R

Journal of Machine Learning Research 2023

Abs HTML

The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP’s tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding. We further prove that the learned state/action abstractions provide accurate approximations to latent block structures if they exist, enabling function approximation in downstream tasks such as policy evaluation.
Learning good state and action representations via tensor decomposition

Ni, Chengzhuo, Zhang, Anru, Duan, Yaqi, and Wang, Mengdi

IEEE International Symposium on Information Theory (ISIT) 2021

Abs arXiv

The transition kernel of a continuous-state-action Markov decision process (MDP) admits a natural tensor structure. This paper proposes a tensor-inspired unsupervised learning method to identify meaningful low-dimensional state and action representations from empirical trajectories. The method exploits the MDP’s tensor structure by kernelization, importance sampling and low-Tucker-rank approximation. This method can be further used to cluster states and actions respectively and find the best discrete MDP abstraction. We provide sharp statistical error bounds for tensor concentration and the preservation of diffusion distance after embedding.