Neural Adaptive Video Streaming with Pensieve by Hongzi Mao, Ravi Netravali, Mohammad Alizadeh (MIT). This paper is accepted by SIGCOMM17.
Topic: adaptive bitrate (ABR) algorithm
Challenges:
- Network can fluctuate over time and hard to be predicted. Therefore, ABR algorithms should consider more stable signals like buffer occupancy.
- QoE factors should be balanced.
- The result of ABR algorithm has cascading effects.
- The tradeoff between high quality and playback continuity.
In this paper, the author considers a learning based ABR algorithm which applies Reinforcement Learning (RL) — Pensive . The goal of the learning is to maximize the expected QoE. Fig.1 shows how RL is applied to ABR.

Pensive uses A3C as the training algorithm for RL, a state-of-the-art actor-critic method which contains training two neural networks. More specifically, the diagram of the network is:

Inputs
As we can see, it contains two parts: an actor network and a critic network. Both of them take the same input and has two hidden layers. The inputs are:
- x: network throughput for past k chunks (k = 8)
- \tau: download time of the past k chunks
- n: m available sizes of the next chunk (m = 6)
- b: current buffer level
- c: number of chunks remaining in the video
- l: last chunk’s bitrate
Actor-Critic Network
The goal of the actor network is to learn a policy \pi : \pi(s, a), which indicates the probability that action a is taken in state s.
The goal of the critic network is to learn an estimate of value function v (the expected total reward starting at state s and following the policy \pi).
The networks parameters \theta are updated during the training process. Generally, \theta are calculated as:

In practice, the agent samples a trajectory of bitrate decisions and uses the empirically computed advantage A(s,a).

By adding regularization,\theta are updated according:

Some Details of Network Layers
The 1D-CNN layer process vector inputs. It uses 128 filters, each of size 4 with stride 1. The results of 1D-CNN are gathered with other inputs in a hidden layer that uses 128 neurons to apply the softmax function.