Verified Solution

[pytorch/pytorch] [Bug] NaN gradients in varlen_attn backward pass when input length exceeds cu_seqlens[-1]

### ROOT CAUSE The issue arises because the backward pass of `varlen_attn` does not properly handle inputs longer than the total length specified by `cu_seqlens[-1]`. The gradients for the extra padding tokens are not masked, leading to numerical instability and NaN values during the backward pass. ### CODE FIX Modify the backward pass of `varlen_attn` to include a gradient mask that sets gradients to zero for tokens beyond `cu_seqlens[-1]`. Here's the fix: ```python def varlen_attn_forward(ctx, input, cu_seqlens, is_causal=False): # ... forward pass code ... total_length = cu_seqlens[-1] # Store total_length in ctx for backward pass ctx.total_length = total_length return output def varlen_attn_backward(ctx, grad_output): grad_input = grad_output.clone() # Mask gradients for tokens beyond total_length mask = torch.ones(grad_input.size(0), dtype=torch.bool, device=grad_input.device) mask[ctx.total_length:] = False grad_input[mask] = 0 return grad_input ``` This ensures gradients for padding tokens are zeroed out, preventing NaNs.

Deploy on DigitalOcean ($200 Credit)

Related Fixes

[gitlab-org/gitlab] Creation flow for custom types

[StackOverflow/rust] Is it possible to have a vec of a struct with generic phantom data in rust?

[StackOverflow/go] Loading a needed file, relative vs absolute paths