Verified Solution

[pytorch/pytorch] torch.dot under vmap lowers to unfused extern_kernels.bmm — pointwise mul+sum is 25-30% faster for small vectors

### ROOT CAUSE The inefficiency arises because `torch.dot` is compiled and batched using `extern_kernels.bmm`, which is not optimized for small vectors. The `bmm` implementation has high dispatch overhead, prevents fusion, and results in many kernel launches. ### CODE FIX Replace `torch.dot(a, b)` with `(a * b).sum(-1)` in the code. This operation is more efficient for small vectors as it can be fused and executed in a single kernel. The change maintains correctness for 1D inputs and improves performance for batched operations on small vectors.

Deploy on DigitalOcean ($200 Credit)

Related Fixes

[golang/go] x/tools/gopls: add a way to provide gopls settings for the built-in MCP server

[facebook/react] Bug: React Compiler does not preserve HTML entity

[StackOverflow/docker] Deploying chaincode failed, socket is broken