Verified Solution[pytorch/pytorch] torch.dot under vmap lowers to unfused extern_kernels.bmm — pointwise mul+sum is 25-30% faster for small vectors
Sponsored Content
### ROOT CAUSE
The inefficiency arises because `torch.dot` is compiled and batched using `extern_kernels.bmm`, which is not optimized for small vectors. The `bmm` implementation has high dispatch overhead, prevents fusion, and results in many kernel launches.
### CODE FIX
Replace `torch.dot(a, b)` with `(a * b).sum(-1)` in the code. This operation is more efficient for small vectors as it can be fused and executed in a single kernel. The change maintains correctness for 1D inputs and improves performance for batched operations on small vectors.
Deploy on DigitalOcean ($200 Credit)
Related Fixes
[golang/go] x/tools/gopls: add a way to provide gopls settings for the built-in MCP server
[facebook/react] Bug: React Compiler does not preserve HTML entity
[StackOverflow/docker] Deploying chaincode failed, socket is broken