Verified Solution[pytorch/pytorch] [Bug] NaN gradients in varlen_attn backward pass when input length exceeds cu_seqlens[-1]
Sponsored Content
### ROOT CAUSE
The issue arises because the backward pass of `varlen_attn` does not properly handle inputs longer than the total length specified by `cu_seqlens[-1]`. The gradients for the extra padding tokens are not masked, leading to numerical instability and NaN values during the backward pass.
### CODE FIX
Modify the backward pass of `varlen_attn` to include a gradient mask that sets gradients to zero for tokens beyond `cu_seqlens[-1]`. Here's the fix:
```python
def varlen_attn_forward(ctx, input, cu_seqlens, is_causal=False):
# ... forward pass code ...
total_length = cu_seqlens[-1]
# Store total_length in ctx for backward pass
ctx.total_length = total_length
return output
def varlen_attn_backward(ctx, grad_output):
grad_input = grad_output.clone()
# Mask gradients for tokens beyond total_length
mask = torch.ones(grad_input.size(0), dtype=torch.bool, device=grad_input.device)
mask[ctx.total_length:] = False
grad_input[mask] = 0
return grad_input
```
This ensures gradients for padding tokens are zeroed out, preventing NaNs.
Deploy on DigitalOcean ($200 Credit)
Related Fixes
[gitlab-org/gitlab] Creation flow for custom types
[StackOverflow/rust] Is it possible to have a vec of a struct with generic phantom data in rust?
[StackOverflow/go] Loading a needed file, relative vs absolute paths