Motivation:
- state-of-the-art ABR algorithms are faced with many challenges including the variability of network throughput and conflicting video QoE requirements.
Introduction:
- The writer uses “Pensive” system to learn ABR algorithm automatically with reinforcement learning(LR) techniques which can learn to make better decisions on rate adaptation.
- The pensive system is deployed on the server side to reduce the computation power.
- The proposed method based on reinforcement learning (a kind of neural network) generates ABR algorithms from observations rather than fine-tuned heuristics.
- The MPC algorithm performs better than approaches used fixed heuristics.
- The writer compares the traditional ABR algorithms and the Pensieve using a broad set of network conditions(trace-based and wild) and QoE metrics.
System:
- The RL algorithm is about an agent interacting with an environment. The parameters includes state, action, reward while the goal is to maximize the cumulative discounted reward.
- RL system: input: 1>client playback buffer occupancy 2> past bitrate decisions 3>several raw network signals (e.g., throughput measurements) output: 1>the bitrate for the next chunk. reward:1> resulting QoE(passed back to improve the neural network)
- The writer compares the LR method with the robustMPC in two examples to motivate it. (EXP1: network throughout put is variable EXP2: optimize for a new QoE metric which is geared towards users who strongly prefer HD video) The reason of MPC’s bad performance is that it lacks an accurate model of network dynamics.
- According to the investigation of TCP’s slow-start-restart algorithm, the writer makes a conclusion that none of the available training can accurately simulate real world system, nonetheless, as long as the Pensieve experiences a large enough of network conditions, it can learn high quality ABR algorithms.
- Pensive’s training algorithm uses A3C, a state-of-the-art actor-critic method involving training 2 neural networks. The networks are actor network(trains to output the appropriate rates of next video chunk) and critic network(trains to learn reward). To explore the action space, the author adds a function with parameter ‘β’ to represent the exploration, which is set to a large value at the start and decreases over time.
- To speed the training, the Pensieve spawns multiple learning agents in parallel(default 16 agents) which can sent state information to a center agent and generate a single ABR algorithm model.
Experiment:
- Comparison: 1>Buffer-based 2>Rate-based 3>BOLA 4> MPC 5>robustMPC
- QoE function:

3.

4. A large portion of Pensieve’s performance gains come from its ability to limit rebuffering, though it can not outperform in all schemes on every QoE factor, it can balance each factor in a way that optimizes the QoE metric.

5. Pensieve algorithm could encounter new networks, to evaluate this, the writer conduct 2 experiments(real world experiments and purely synthetic dataset). And it turns out that they are mainly the same.
6. The author compares the single-video Pensieve and the multi-video Pensieve, mainly the same, too.
7. Compared with online optimal, the average QoE is only within 0.2%. As for the offline optimal(all the parameters and circumstances are known beforehand, which is impossible) the average QoE is only within 9.1%. It implies that the Pensieve algorithm is good enough. Even compared with the ‘perfect’ scheme, it performs well enough.

Ref:
- http://blog.csdn.net/coffee_cream/article/details/57085729
- https://www.zhihu.com/question/56692640/answer/154994442
- https://zhuanlan.zhihu.com/p/28696979?utm_medium=social&utm_source=wechat_timeline