diff --git a/jump_lin_id/id_paper.tex b/jump_lin_id/id_paper.tex index 5ffbb9b2216a0eae4bd4120ae4fc30d09e73219f..227d7529c07563a10a7c225f6bd87e692ae63825 100644 --- a/jump_lin_id/id_paper.tex +++ b/jump_lin_id/id_paper.tex @@ -59,14 +59,14 @@ \definecolor{linecolor4}{rgb}{0.6,0,0.5} %% Include comments -%\newcommand{\recmt}[1]{{\color{red}{\textbf{Reviewers comment:} #1}}} +\newcommand{\recmt}[1]{{\color{red}{\textbf{Reviewers comment:} #1}}} \newcommand{\cmt}[1]{{\color{yellow}{\textbf{Comment:} #1}}} %\newcommand{\rcmt}[1]{Rolfs comment: {\color{red}{ #1}}} %\newcommand{\revised}[3]{{\color{gray} #1}{\color{green} \textbf{Added:} #2}{\color{yellow}\textbf{Bagge's comment:} #3} } %\usepackage[utf8]{inputenc} %% Exclude comments -\newcommand{\recmt}[1]{} +% \newcommand{\recmt}[1]{} %\newcommand{\cmt}[1]{} \newcommand{\rcmt}[1]{} \newcommand{\revised}[3]{#2} @@ -152,7 +152,7 @@ SE22100 Lund Sweden\protect\\ \pagestyle{empty} \begin{abstract} -We review some classical methods for offline identification of linear, time-varying (LTV) dynamical models and put them into a larger, more general context where we introduce a number of new techniques. Specifically, we introduce methods for identification of linear dynamical systems where the change in dynamics can be either discontinuous or smooth and where the number of changes in the dynamics parameters may be unknown or specified. We further discuss the introduction of priors on the model parameters for increased statistical power in situations where excitation is insufficient for identification. The identification problems are cast as convex optimization problems and are applicable to, e.g., ARX-models and state-space models with time-varying parameters. We illustrate usage of the methods in simulations and in an application of model-based reinforcement learning. +We review some classical methods for offline identification of linear, time-varying (LTV) dynamical models and put them into a larger, more general context where we introduce a number of new techniques. Specifically, we introduce methods for identification of linear dynamical systems where the change in dynamics can be either discontinuous or smooth and where the number of changes in the dynamics parameters may be unknown or specified. We further discuss the introduction of priors on the model parameters for increased statistical power in situations where excitation is insufficient for identification. The identification problems are cast as convex optimization problems and are applicable to, e.g., ARX-models and state-space models with time-varying parameters. We illustrate usage of the methods in simulations and in an application of model-based reinforcement learning. \cmt{We establish a connection between trend filtering and system identification which results in novel sys id methods.} \end{abstract} \begin{keywords} System Identification, Piecewise Linear Modeling, Trend Filtering, Kalman Smoothing, Time-Series Segmentation, Reinforcement Learning. @@ -168,9 +168,9 @@ The difficulty of the task of identifying time-varying dynamical models of syste Identification of systems with non-smooth dynamics evolution has been studied extensively. The book~\cite{costa2006discrete} treats the case where the dynamics are known, but the state sequence unknown. In~\cite{nagarajaiah2004time}, the authors examine the residuals from an initial constant dynamics fit to determine regions in time where improved fit is needed by the introduction of additional constant dynamics models. Results on identifiability and observability in jump-linear systems in the non-controlled setting are available in~\cite{vidal2002observability}. The main result on identifiability was a rank condition on a Hankel matrix constructed from the collected output data, similar to classical results on the least-squares identification of ARX-models which appears as rank constraints on the, typically Toeplitz or block-Toeplitz, regressor matrix. Identifiability of the methods proposed in this article are discussed in~\cref{sec:identifiability}. -An important class of identification methods that have been popularized lately are \emph{trend filtering} methods~\cite{kim2009ell_1, tibshirani2014adaptive}. Trend filtering methods work by specifying a \emph{fitness criterion} that determines the goodness of fit, as well as a \emph{regularization} term, often chosen with sparsity promoting qualities. As a simple example, consider the reconstruction $\hat y$ of a noisy signal $y$ with piecewise constant segments. To this end, we might formulate and solve the convex optimization problem +An important class of identification methods that have been popularized lately are \emph{trend filtering} methods~\cite{kim2009ell_1, tibshirani2014adaptive}. Trend filtering methods work by specifying a \emph{fitness criterion} that determines the goodness of fit, as well as a \emph{regularization} term, often chosen with sparsity promoting qualities. As a simple example, consider the reconstruction $\hat y$ of a noisy signal $y = \left\{y_t\inspace{}\right\}_{t=1}^T$ with piecewise constant segments. To this end, we might formulate and solve the convex optimization problem \begin{equation} \label{eq:tf} - \minimize{\hat{\y}} \normt{\y-\hat{\y}}^2 + \lambda\sum_t |\hat{\y}_{t+1} - \hat{\y}_t| + \minimize{\hat{y}} \normt{y-\hat{y}}^2 + \lambda\sum_t |\hat{y}_{t+1} - \hat{y}_t| \end{equation} The first term is the fitness criterion or \emph{loss function}, whereas the second term is a sparsity-promoting regularizer which promotes small changes between consecutive samples in the reconstructed signal. The sparsity promoting effect of the 1-norm regularizer is well known, and stems from the constant size of the gradient whenever the argument is non-zero~\cite{murphy2012machine}. Compare this to the squared difference, for which the gradient rapidly vanishes as the argument approaches zero. The squared difference will thus promote \emph{small} arguments, whereas the 1-norm promotes \emph{sparse} arguments. @@ -182,7 +182,7 @@ We start by considering the case of identification of the parameters in an LTI m \begin{equation} x_{t+1} = A x_t + B u_t + v_t, \quad t \in [1,T] \end{equation} -where $x$ and $u$ are the state and input respectively. If the state and input sequences are known, a plethora of methods for estimating the parameters exists. A common method for identification of systems that are linear in the parameters is the least-squares (LS) method, which in case of Gaussian noise, $v$, coincides with the maximum likelihood (ML) estimate. To facilitate estimation using the LS method, we write the model on the form $\y = \A\w$, and arrange the data according to +where $x\inspace{n}$ and $u\inspace{m}$ are the state and input respectively. If the state and input sequences are known, a plethora of methods for estimating the parameters exists. A common method for identification of systems that are linear in the parameters is the least-squares (LS) method, which in case of Gaussian noise, $v$, coincides with the maximum likelihood (ML) estimate. To facilitate estimation using the LS method, we write the model on the form $\y = \A\w$, and arrange the data according to \begin{align*} \y &= \begin{bmatrix} @@ -222,7 +222,8 @@ where the parameters $\w$ are assumed to evolve according to the dynamical syste y_t &= \big(I_n \otimes \bmatrixx{x\T & u\T}\big) \w_t \end{split} \end{equation} -where, if no prior knowledge is available, the dynamics matrix $H$ can be taken as the identity matrix. +\recmt{This eq not clear. Motivate noise terms better.} \cmt{Connect noise terms to MLE} +where, if no prior knowledge is available, the dynamics matrix $H_t$ can be taken as the identity matrix. The evolution of the dynamics is thus governed by the emission density function $p_v(x_{t+1} | x_t, u_t, k_t)$ and a state transition density function $p_w(k_{t+1}|k_t)$. The density $p_v$ is determining the distribution of measurements, given the current state of the system, whereas $p_w$ governs the transition of the parameter vector between time steps. We emphasize here that the state in the parameter evolution model refers to the current parameters $k_t$ and not the system state $x_t$, hence, $p_v$ is called the emission density and not the transition density. Particular choices of $p_v$ and $p_w$ emit data likelihoods concave in the parameters and hence amenable to convex optimization. @@ -230,7 +231,7 @@ We emphasize here that the state in the parameter evolution model refers to the The following sections will introduce a number of optimization problems with different regularization functions and regularization arguments, corresponding to different choices of $p_w$, and discuss the quality of the resulting identification. -\subsection{Slow time evolution} +\subsection{Slow time evolution \cmt{Low frequency time evolution}} A slowly varying signal is characterized by \emph{small first-order time differences}. To identify slowly varying dynamics parameters, we thus penalize the squared 2-norm of the first-order time difference of the model parameters, and solve the optimization problem \begin{equation} \label{eq:slow} @@ -239,10 +240,11 @@ To identify slowly varying dynamics parameters, we thus penalize the squared 2-n where $\sum_t$ denotes the sum over meaningful indices $t$, in this case $t\in [1,T-1]$. This optimization problem has a closed form solution\footnote{The computational complexity $\mathcal{O}\big((TK)^3\big)$ of computing $k^*$ using the closed-form solution as opposed to an iterative method becomes prohibitive for all but toy problems.} \begin{align}\label{eq:closedform} - \tilde{\w}^* &= (\tilde{\A}\T\tilde{\A} + \lambda D_1\T D_1)^{-1}\tilde{\A}\T \tilde{y}\\ + \tilde{\w}^* &= (\tilde{\A}\T\tilde{\A} + \lambda D_1\T D_1)^{-1}\tilde{\A}\T \tilde{Y}\\ \tilde{\w} &= \operatorname{vec}(\w_1, ...\,, \w_T)\nonumber \end{align} -where $\tilde{\A}$ and $\tilde{y}$ are appropriately constructed matrices and the first-order differentiation operator matrix $D_1$ is constructed such that $\lambda\normt{D_1 \tilde{\w}}^2$ equals the second term in~\labelcref{eq:slow}. +\recmt{$\tilde{y}$ for a matrix is confusing} +where $\tilde{\A}$ and $\tilde{Y}$ are appropriately constructed matrices and the first-order differentiation operator matrix $D_1$ is constructed such that $\lambda\normt{D_1 \tilde{\w}}^2$ equals the second term in~\labelcref{eq:slow}. \Cref{eq:slow} is the negative data log-likelihood of a Brownian random-walk parameter model for which also a Kalman smoother returns the optimal solution, see~\cref{sec:kalmanmodel} for details and extensions. \subsection{Smooth time evolution} @@ -271,7 +273,7 @@ which results in a dynamics evolution with sparse changes in the coefficients, b Dynamics may change abruptly as a result of, e.g., system failure, change of operating mode, or when a sudden disturbance enters the system, such as a policy change affecting a market or a window opening affecting the indoor temperature. This identification method can be employed to identify when such changes occur, without specifying apriori how many changes are expected. \subsubsection{Implementation} -Due to the non-squared norm penalty $\sum_t \normt{ \w_{t+1} - \w_t}$, this problem is significantly harder to solve than \labelcref{eq:smooth}. An efficient implementation using the ADMM algorithm is made available in the accompanying repository. +Due to the non-squared norm penalty $\sum_t \normt{ \w_{t+1} - \w_t}$, problem \labelcref{eq:pwconstant} is significantly harder to solve than \labelcref{eq:smooth}. An efficient implementation using the ADMM algorithm is made available in the accompanying repository. \subsection{Piecewise constant time evolution with known number of steps} @@ -282,7 +284,7 @@ If the number of switches in dynamics parameters, $M$, is known in advance, the & \subjto & & \sum_t \textbf{1}\{ \w_{t+1} \neq \w_t\} \leq M \end{align} where $\textbf{1}\{\cdot\}$ is the indicator function. -This problem is non-convex, but admits an exact solution through dynamic programming (DP), with an extended version of \cite{bellman1961approximation}, an algorithm frequently referred to as segmented least-squares~\cite{bellman1969curve}. The extension lies in the association of each segment with a dynamics model, as opposed to a simple straight line.\footnote{Indeed, if a simple integrator is chosen as dynamics model and a constant input is assumed, the result of our extended algorithm reduces to the segmented least-squares solution.} Unfortunately, the computational complexity of the dynamic programming solution, $\mathcal{O}(T^2K^3)$, becomes prohibitive for large $T$.\footnote{For details regarding the DP algorithm and implementation, the reader is referred to the source-code repository accompanying this article.} +This problem is non-convex, but admits an exact solution through dynamic programming (DP), with an extended version of \cite{bellman1961approximation}, an algorithm frequently referred to as segmented least-squares~\cite{bellman1969curve}. The extension lies in the association of each segment with a dynamics model, as opposed to a simple straight line \recmt{sentence not clear}.\footnote{Indeed, if a simple integrator is chosen as dynamics model and a constant input is assumed, the result of our extended algorithm reduces to the segmented least-squares solution.} Unfortunately, the computational complexity of the dynamic programming solution, $\mathcal{O}(T^2K^3)$, becomes prohibitive for large $T$.\footnote{For details regarding the DP algorithm and implementation, the reader is referred to the source-code repository accompanying this article.} \subsection{Piecewise linear time evolution}\label{eq:pwlinear} A piecewise linear signal is characterized by a \emph{sparse second-order time difference}, i.e., it has a small number of changes in the slope. A piecewise linear time-evolution of the dynamics parameters is hence obtained if we solve the optimization problem. @@ -328,6 +330,7 @@ To identify the points at which the dynamics change, we observe the relevant sig where $a_t$ taking non-zero values indicate change points. \section{Dynamics prior and Kalman filtering} +\recmt{Reviewer found this section rushed} The identifiability of the parameters in a dynamical model hinges on the observability of the dynamics system~\labelcref{eq:dynsys}, or more explicitly, only modes excited by the input $u$ will be satisfactorily identified. If the identification is part of an iterative learning and control scheme, e.g., ILC or reinforcement learning, it might be undesirable to introduce additional noise in the input to improve excitation for identification. This section will introduce prior information about the dynamics which mitigates the issue of poor excitation of the system modes. The prior information might come from, e.g., a nominal model known to be inaccurate, or an estimated global model such as a Gaussian mixture model (GMM). A statistical model of the joint density $p(x_{t+1},x_t,u_t)$ constructed from previously collected tuples $(x_{t+1},x_t,u_t)$ provides a dynamical model of the system through the conditional $p(x_{t+1}|x_t,u_t)$. We will see that for priors from certain families, the resulting optimization problem remains convex. For the special case of a Gaussian prior over the dynamics parameters or the output, the posterior mean of the parameter vector is conveniently obtained from a Kalman-smoothing algorithm, modified to include the prior. @@ -357,6 +360,7 @@ If all densities are Gaussian and $\w$ is modeled with the Brownian random walk &+ \sum_{t=1}^{T} \norm{\mu_0(z_t) - \w_{t}}^2_{\Sigma^{-1}_0(z_t)} \end{split} \end{equation} +\recmt{Meaning of \Sigma not clear} for some function $\mu_0(z_t)$ which produces the prior mean of $\w$ given $z_t$, and $\norm{x}_{\Sigma^{-1}}^2 = x\T\Sigma^{-1}x$. In this special case, we introduce a recursive solution given by a modified Kalman smoothing algorithm, where the conditional mean of the state is updated with the prior. Consider the standard Kalman filtering equations @@ -367,7 +371,7 @@ In this special case, we introduce a recursive solution given by a modified Kalm \hat{x}_{t|t} &= \hat{x}_{t|t-1} + K_t\big(y_t-C\hat{x}_{t|t-1}\big)\\ P_{t|t} &= P_{t|t-1} - K_t C P_{t|t-1} \end{align} - where $x$ is the state vector, with state-drift covariance $R_1$ and $C$ is a matrix that relates $x$ to a measurement $y$ with covariance $R_2$ according to $y=Cx$. The two first equations constitute the \emph{prediction} step, and the last two equations incorporate the measurement $y_t$ in the \emph{correction} step. + where $x$ is the state vector, with state-drift covariance $R_1$ and $C$ is a matrix that relates $x$ to a measurement $y$ with covariance $R_2$ according to $y=Cx$. The first two equations constitute the \emph{prediction} step, and the last two equations incorporate the measurement $y_t$ in the \emph{correction} step. The modification required to incorporate a Gaussian prior on the state variable $p(x_t|v_t) = \N(\mu_0(v_t), \Sigma_0(v_t))$ involves a repeated correction step and takes the form \begin{align} \bar{K}_t &= P_{t|t}\big(P_{t|t}+\Sigma_0(v_t)\big)^{-1}\label{eq:postk}\\ @@ -426,20 +430,20 @@ and 1 of the differentiation operator $D$ respectively. The problem of well-posedness thus reduces to determining the rank of the Hessian matrix $\tilde{\A}\T \tilde{\A} + D\T D$, where -$\operatorname{rank}(D\T D) = K(T-2)$ respectively -$K(T-1)$. The matrix $\tilde{\A}\T \tilde{\A} + D\T D$ +$\operatorname{rank}(D_2\T D_2) = K(T-2)$ respectively +$\operatorname{rank}(D_1\T D_1) = K(T-1)$ . The matrix $\tilde{\A}\T \tilde{\A} + D\T D$ will thus be full rank whenever $\tilde{\A}$ has rank $2K$ or more, with a basis spanning the rank-$2K$ null space of $D$. This will be the case for any non-zero -state and input sequences off sufficient length, provided that the input, $u$, +state and input sequences of sufficient length, provided that the input, $u$, is persistently exciting of sufficient order~\cite{johansson1993system}. Another strategy is to analyze the observability of the dynamical systems~\labelcref{eq:dynsys,eq:dynsys2}, -which will yield the desired results for problems~\labelcref{eq:slow,eq:smooth} +which will yield the desired results for problems~\labelcref{eq:slow,eq:smooth}. \section{Example} - +\recmt{Only linear systems considered (pend-cart actually not linear.} \cmt{Highligt that pendcart is nonlinear or introduce simple robotic example. Mention early that an LTV system along a trajectory does not generalize far for a nonlinear system, hence the use of KL constraint.} % \begin{figure} % \centering % \setlength{\figurewidth}{0.99\linewidth} @@ -476,7 +480,7 @@ to $$A_t = \left[ \right] $$ occurred at $t=200$. -The input was Gaussian noise of zero mean and unit variance, state transition noise and measuremet noise of zero mean and $\sigma = 0.2$ were added. The generated and filtered signals signals are shown in~\cref{fig:states}. +The input was Gaussian noise of zero mean and unit variance, state transition noise and measuremet noise of zero mean and $\sigma = 0.2$ were added. The generated and filtered signals are shown in~\cref{fig:states}. \Cref{fig:ss} depicts the estimated coefficients in the dynamics matrices for a value of $\lambda$ chosen using the L-curve method~\cite{hansen1994regularization}. \begin{figure} \centering @@ -493,13 +497,14 @@ The input was Gaussian noise of zero mean and unit variance, state transition no \setlength{\figureheight }{6cm} \input{figs/ss.tex} \figurecaptionreduction - \caption{Piecewise constant state-space dynamics. True values are shown with dashed, black lines. Gaussian state-transition and measurement noise with $\sigma = 0.2$ were added.} + \caption{Piecewise constant state-space dynamics. True values are shown with dashed, black lines. Gaussian state-transition and measurement noise with $\sigma = 0.2$ were added. \recmt{Hard to read, maybe showing error is better.}} \label{fig:ss} \end{figure} \section{Reinforcement learning} +\recmt{More details needed} In this example, we use the proposed methods to identify LTV dynamics models for reinforcement learning. The task is to stabilize a pendulum attached to a moving cart by means of moving the cart, with bounds on the control signal and a quadratic @@ -553,7 +558,7 @@ the LTV model is fit using a prior (\cref{sec:kalmanmodel}), the learning speed \section{Discussion} - +\recmt{Some paragraphs hard to follow. Linear operator extension can be moved to theory section. } This article is written focused on state-space models, where the state sequence is assumed to be known. This may seem like an overly optimistic setting at first, but all methods and algorithm discussed apply equally well to the input-output @@ -596,7 +601,7 @@ This allows incorporation of different scales for different variables with littl The main benefit of the regularization is the prior belief over the nature of the change in dynamics parameters it represents. By postulating a prior belief that the dynamics parameters change in a certain way, less demand is put on the data required for identification. The identification process will thus not interfere with operation in the same way as if noise would be added to the input for identification purposes. This allows learning of flexible, over-parametrized models that fit available data well. This makes the proposed identification methods attractive in applications such as guided policy search (GPS)~\cite{levine2013guided, levine2015learning} and non-linear iterative learning control (ILC)~\cite{bristow2006survey}, where they can lead to dramatically decreased sample complexity. -The Kalman-filter based identification methods can be generalized to solving optimization problems where the argument in the regularizer appearing in \labelcref{eq:smooth} is replaced for an arbitrary linear operation on the parameter vector, $P(z)k$. the extension required for the is to factor out the polynomial $z-1$ from the polynomial $P(z)$, which for~\labelcref{eq:smooth} equals $z^2 - 2z + 1 = (z-1)^2$, and augmenting the dynamical model describing the parameter evolution with a state corresponding to the polynomial $Q(x) = (z-1)^{-1}P(z)$. +The Kalman-filter based identification methods can be generalized to solving optimization problems where the argument in the regularizer appearing in \labelcref{eq:smooth} is replaced for an arbitrary linear operation on the parameter vector, $P(z)k$. The extension required is to factor out the polynomial $z-1$ from the polynomial $P(z)$, which for~\labelcref{eq:smooth} equals $z^2 - 2z + 1 = (z-1)^2$, and augmenting the dynamical model describing the parameter evolution with a state corresponding to the polynomial $Q(x) = (z-1)^{-1}P(z)$. \cmt{Expand and put in theory section.} The drawbacks of LTV models include situations where the observed time variation comes from non-linear effects. Non-linearities in the system may cause an apparent time-variation as the state varies. While an LTV-model can capture this variation, it will fail to generalize well to a subsequent experiment where the states take a significantly different trajectory. Many non-linear systems are, however, approximately \emph{locally} linear, such that they are well described by a linear model in a small neighborhood around the linearization/operating point. LTV models can thus be efficiently used for model-based reinforcement learning with variants of trust-region based optimization~\cite{schulman2015trust}.