(July 26th, 2020)

This became a published paper on Arxiv.

## Can we use equivariance to disentangle components of a video?

### Background

- As explored in ICML 2020, an equivariant renderer is one where"

- It learns by training to minimize , where is a known rotation of the input and is the same rotation but in space. Afterwards, can render unknown viewpoints just by transforming .

- Note that in a video we have several parts which can be placed into two sets: C and B. Here, C is the set of moving characters and B is the static background (everything else).

- Further, at an appropriate level of granularity, the change in a characterβs position between any sufficiently close frames and can be modeled as an affine transformation . This is a model of the movement and is of course prone to mistakes, but itβs a reasonable assumption that should hold better as we increase the frames per second.

### Hypotheses

- From this model, we hypothesize that we can disentangle a scene into characters and background by training a renderer to be equivariant to affine transformations of a character and invariant to transformations of the background.

- In other words, assuming a single moving character, we train s.t.

- To do this, we utilize one main scene loss along with two supplementary losses representing the character and background respectively:

- The domains and ranges are as follows:
*: image of shape feature of shape*- : image of shape feature of shape
*: <see variations below>*- : image of shape feature of shape

- Because is varying, it should account for the varying parts of the sequence (the character) and because is constant, it should account for the non-varying parts, the background. This has hints of slow feature analysis.

- Afterward, we now have the ability to render the character in any affine way on any of the trained backgrounds by mixing the backgrounds via and the affine transformations of . This can let us make new videos with this character.

- Note that the acts on like a spatial transformer. Here, is not a flat encoding but rather the output of a convolutional step (so ) and so applies an affine transformation to each part.

### Variations

- Assuming a 2d image sequence, and are both affine matrices. What is the domain of ?
- One variation is that it is a function of just , i.e.Β . In that case, we need to know the affine transformation applied to the character during training.
- Another variation is that it is a function of , so . This variation allows us to learn from just sequences and thus create a catalog of template with which we can render the character from scene to scene in a similar manner as it was rendered elsewhere.
- And a third variation is that is a function of , so . This variation is arguably more founded than the above one because it means that does not need to relearn how to separate and can instead take advantage of how needs to learn that.
- We ultimately want to render as if it was transformed by the same transformation that took .