Loading...
アイコン

Machine Learning Studio

チャンネル登録者数 6000人

3671 回視聴 ・ 125いいね ・ 2023/09/17

PostLN Transformers suffer from unbalanced gradients, leading to unstable training due to vanishing or exploding gradients. Using a learning-rate Warmup stage is considered as a practical solution, but that also requires running more hyper-parameters, making the Transformers training more difficult.
In this video, we will look at some alternatives to the PostLN Transformers, including PreLN Transformer, and the ResiDual, a Transformer with Double Residual Connections.

References:
1. "On Layer Normalization in the Transformer Architecture", Xiong et al., (2020)
2. "Understanding the Difficulty of Training Transformers", Liu et al., (2020)
3. "ResiDual: Transformer with Dual Residual Connections", Xie et al., (2023)
4. "Learning Deep Transformer Models for Machine Translation", Wang et al., (2019)

コメント

コメントを取得中...

コントロール
設定

使用したサーバー: directk