learning_ideas.jmd

```{julia; eval=false; echo=false}
if !isinteractive()
    using Weave
    weave("/local/home/fredrikb/papers/dmp_dpg/learning_ideas.jmd")
end
```
# How to learn discontinuous functions?
\keywords{acceleration}

## Weak learner / Track error statistics
Add a weak learner in parallel with the complex learner. Alternatively: if error suddenly grows significantly, learn last layers weights using LS. This will make a large jump in the last layers weights that can account for the big change in the data distribution. 

## Boosting
Boosting would work well if the discontinuity is axis aligned, probably not otherwise.

## L1 gates
This idea could be handled by a bit mask or sigmoid gate with L1 penalty on gate openings.
The input (or output?) is guarded by a sigmoid gate that has, e.g., an L1 penalty on the activation,
it is therefore beneficial to keep a small number of gates open in the beginning.
When a large error suddenly appears, a gate can be opened. This strategy is completely differentiable.
ReLu units probably work well.
After the gates there are just more network, e.g.,

```math
\text{gate} = \sigma(s)
y = \operatorname{activation}(W_1 s) + \operatorname{activation}(W_2 \text{gate}) + b\\
loss = ||y-\hat{y})||_2^2 + |(\text{gate})|
```
**The gate must be multiplied with an additional network.**

```{julia;eval=false}
function get_gate(input,gate_size)
    W    = Variable(0.00001randn(Float32,size(input,2),gate_size),trainable=true, name="gate_weights")
    b    = Variable(0.00001randn(Float32,gate_size), trainable=true, name="gate_bias")
    gate = nn.sigmoid(input*W + b, name ="gate")
end

input_size  = 4
output_size = 1
gate_size   = 20
neurons     = [100,50]

s_ = placeholder(Float32, shape=[-1,input_size])
y_ = placeholder(Float32, shape=[-1,output_size])

gate = get_gate(s_, gate_size)

gate2 = concat(1,[Tensor(ones(neurons[1]-gate_size)), gate])

W1 = Variable(0.002*rand(Float32, input_size, neurons[1])-0.001, name="weights1")
W2 = Variable(0.002*rand(Float32, neurons[1], neurons[2])-0.001, name="weights2")
W3 = Variable(0.002*rand(Float32, neurons[2], output_size)-0.001, name="weights3")
B1 = Variable(0.002*rand(Float32,neurons[1])-0.005, name="bias1")
B2 = Variable(0.002*rand(Float32,neurons[2])-0.005, name="bias2")
B3 = Variable(0*ones(Float32,output_size), name="bias3")

l1 = s_*W1 + B1 |> nn.relu
l2 = l1*W2 + B2 |> nn.relu
q1 = l2*W3 + B3

l1 = (s_*W1 + B1 |> nn.relu).*gate2
l2 = (l1*W2 + B2 |> nn.relu)
q2 = (l2*W3 + B3)

q = [q1,q2]

loss         = [reduce_mean((y_ - q).^2) for q in q] # TODO: check dimensions of q
weight_decay = reduce_sum(W1.^2) + reduce_sum(W2.^2) + reduce_sum(W3.^2)
gate_penalty = reduce_sum(abs(gate))

cost       = loss[1] + weight_decay + gate_penalty
train_step = train.minimize(train.AdamOptimizer(1e-3), cost)
run(session, initialize_all_variables())

# Train using initial data distribution
for batch in initial_data
    l,_ = run(session, [loss[1], train_step], Dict(y_ => batch[1], s_ => batch[2]))
end

# Train using modified data distribution
for batch in modified_data
    l,_ = run(session, [loss[1], train_step], Dict(y_ => batch[1], s_ => batch[2]))
end


```

## Implications
This might make the DMP learning feasible. Simplify the DMP problem significantly and just show that it is possible to learn and adjust DMP parameters using DPG.
Initially learn the value function around the *single* demonstration and use a pessimistic prior or bounded network. Then use the standard framework to update the DMP,
also use adaptive network to learn discontinuities.

## Evaluation
Train for a long time on initial distribution. Evaluate RMS on this distribution. Shift distribution and evaluate RMS on the old distribution as well as on the new, especially evaluate maximum error on old dist after distribution shift. Compare
1. $N$ regular neurons
2. $N/2$ regular and $N/2$ gate neurons
3. Only $N/2$ regular neurons
4. Try different optimizers!
5. Try dropout

Could the cost function be modified to have a gate penalty that depends on the batch error?
Maybe the statistical analysis is not a very nice idea if there are very strong interactions. The neighborhood of the points with smallest errors should be investigated further.

Model the average loss as a random walk, keep track of innovation covariance and use it to determine when the loss is significantly larger than usual. When this occur, reduce l1 penalty on gate, increase step size on gate and reduce step size on non gated neurons. The step-sizes can be proportional/inversly proportional to the number of sigmas the batch loss is from the average loss.


# Adaptive TD(λ) / TD(n)
\keywords{acceleration}
One should strive for estimation of the Q function under the optimal policy, thus, one should hesitate before lowering ones estimate for a certain state-action pair. However, if one encounters a rollout for which the discounted return is significantly higher than what is estimated by the Q-function, one should probably do a montecarlo update using this return. This rollout does maybe not come from the current policy, but it comes from a closer to optimal policy. To determine if there is a significant difference between the obtained montecarlo estimate and the Q-function, one can look at the difference normalized by the current TD-error

$$\delta_t^{TD} = r_t + Q(s_{t+1},\mu(s_{t+1})) - Q(s_t,a_t)$$

$$\delta_t^{MC} = R_t - Q(s_t,a_t)$$

$$\dfrac{\delta_t^{MC}}{\delta_t^{TD}} > \text{Threshold}$$

The reason this works so well is that TD-learning is sensitive to the initialization of the $Q$-function.
Poor initialization leads to very slow learning. MC-learning on the other hand can get the $Q$-function in the right ballpark wery fast,
making it especially useful in the beginning of learning.

When doing this "MC-hack", it might be beneficial to reduce $\gamma$ such that the noise inherent to MC-updates is reduced.
This will also mitigate the problem of states being visited several times during the same rollout.

When a rollout arrives, calculate the λ return and use it as target instead of the TD or MC targets ($λ = 0$ or $λ = \infty$) [see also](file:///local/home/fredrikb/papers/dmp_dpg/RL_davidsilver.html). *Idea*: The λ to use can be set adaptively based on how much better we think that rollout is compared to what the value function currently estimates. Alternatively, we apply TD(0), TD(1), TD(2) ... until a condition no longer is met. Then, for a really good rollout, we will continue all the way until MC.
This approach avoids a hard threshold for performing MC-hack, and has the potential to be better motivated when off-policy.
$$\dfrac{\delta^{MC}-\delta^{TD}}{\delta^{TD}}$$

After watching [lec 7](file:///local/home/fredrikb/papers/dmp_dpg/RL_davidsilver.html), this hack seems more like a strange way of doing TD(λ) for actor critic, but only for the critic. Maybe there's still something to the adaptive λ idea.

## How to
Should work on for instance mountain car or similar simple domain.

One can try with $n=T/2$, evaluate the criteria for acceptance, and depending on the sign, chose $n=3T/4$ or $n=T/4$. This binary search can be truncated after $k$ steps to avoid increase in computational complexity.

## Potential issues
What happens when a large negative reward is encountered? This should alos be reflected in the $Q$-function. Maybe all compareisons should be in absolute values.

# Exploration noise
\keywords{exploration, stabilization, acceleration}
What if a few steps of maximization of $Q$ over $a$, starting from $a_0 = μ(s)$, was used as exploration noise? That would mean that we're perturbing our control signal to a possible future control signal, hence checking how that would work out and adopting $Q$ accordingly. If the new control was bad, we can adjust $Q$ to avoid going there. If it is good, we increase the gradient towards that direction.


# Shared layers
\keywords{acceleration, data efficiency}
Use same initial layers for both actor and critic. If a feature is good for learning the value function, it should be good for learning the actor as well. This should increase data efficiency quite a lot for the initial layers and make sure that the actor and citic are "compatible".

## How to
* Use preexisting implementation of DQN from github on Atari?
* Implement my own thing.
* Take someones github implementation and translate it to a proper language, unless julia implementation exits?

## Potential issues
When the shared layers are modified, it affects both actor and critic. As a consequence, the upper layers must be modified significantly faster to keep up with the changing bottom layers and this is not always straight forward. Maybe the actor is not allowed to modify the shared layers? In that case the actor can be set to adapt quite fast, but only with its own layers. The tracking mechanism will hopefully make sure things remain stable.


# DMP learning
* Maybe time has to be included in the state. The policy will take different actions depending on the time due to the phase variable. Hence, the value of a certain state also depends on time.
* Have to implement adaptive TD(λ) for this, or at least eligibility traces.


# Roadmap
0. Make sure I know how eligibility traces for general function approximators work. This is essential in order to improve upon anything, especially with adaptive TD(λ). Also, to tune, start with policy evaluation without modifying the policy. This should work before one tries to update both at the same time.
1. Shared layer actor/critic
2. Adaptive TD(λ)
3. Stable learning using pessimistic prior / bounded output networks
4. DMP-learning in simple domain.
5. Model based acceleration, GPS