In some situations, this meant also changing unrelated files to directly include tsl/platform/statusor.h to get the definitions for TF_ASSIGN_OR_RETURN, etc., where they were getting transitively included for free.
PiperOrigin-RevId: 644129488
A device's slice_index depends on its LocalTopologyProto's boot_id. Currently, all devices in a mocked GPU client will have the same slice_index due to the boot_id being identical, which breaks hybrid mesh construction in AOT compilation.
This change sets a distinct boot_id for each LocalTopology.
PiperOrigin-RevId: 644118436
This CL adds more convolution benchmarks. The benchmarks are based on shapes from XLA convolution tests, TF convolution benchmarks, and Eigen spatial convolution benchmarks.
PiperOrigin-RevId: 644104994
In some situations, this meant also changing unrelated files to directly include tsl/platform/statusor.h to get the definitions for TF_ASSIGN_OR_RETURN, etc., where they were getting transitively included for free.
PiperOrigin-RevId: 644097216
IFRT API now distinguishes reshard vs. copy, so this CL is reflecting such a semantics change to the JAX Python binding. Since pjit input sharding is relatively easy to batch, it was also rewritten to leverage the batched `CopyArrays` API.
PiperOrigin-RevId: 644048754
Imported from GitHub PR https://github.com/openxla/xla/pull/13190
This patch recognises pipelined while loop using the rotate right pattern.
It recognises a rotate-right pattern on sharded inputs and labels the
surrounding while loop as a pipelined while loop. This is an unsafe
optimization. This is hidden behind a debug flag so it wont be triggered in
TPU pipelines or other GPU pipelines unexpectedly.
Copybara import of the project:
--
b643b17a5d25d46e838a37fe87a4134b2cc512aa by Shraiysh Vaishay <svaishay@nvidia.com>:
Add pipelined while loop annotator
This patch recognises pipelined while loop using the rotate right pattern.
It recognises a rotate-right pattern on sharded inputs and labels the
surrounding while loop as a pipelined while loop. This is an unsafe
optimization. This is hidden behind a debug flag so it wont be triggered in
TPU pipelines or other GPU pipelines unexpectedly.
Merging this change closes#13190
PiperOrigin-RevId: 644012174
Imported from GitHub PR https://github.com/openxla/xla/pull/13462
This PR https://github.com/openxla/xla/pull/11514 added workspace allocation to cublas-lt. Basically, it doubled the implementation of a number of functions in gpu/cu/hipblas-lt, i.e. now we have:
```
DoMatmul(..., std::optional<DeviceMemoryBase> workspace)
DoMatmul(..., std::optional<ScratchAllocator*> scratch_allocator)
DoMatmul(..., std::optional<DeviceMemoryBase> workspace, std::optional<ScratchAllocator*> scratch_allocator)
```
and the same holds for ```ExecuteOnStream```. This makes gpublas_lt interface barely readable. The first two functions outlined above are just forwarding calls to the 3rd most generic one. Therefore, I do not see any need to implement these inside the derived classes, i.e. hip_blas_lt.h and cuda_blas_lt.h. Instead, forwarding can be handled in gpu_blas_lt.h interface.
@xla-rotation: could you please have a look?
Copybara import of the project:
--
6d3700a7b4141dee82a3b3f4d6be492a0a67d92b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
refactoring
--
495b2cc7b5a4e944804acddc9abc9442d9cce32a by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
cuda side adaptions
--
4078221daebb8cb88faebe9423e87a1a781a765b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
parameter fix
Merging this change closes#13462
PiperOrigin-RevId: 644003051
We have substantial changes coming soon to TSL logging; add tests
to make sure we will not introduce regressions.
Also remove vmodule_test from tensorflow/core/platform since
it is now redundant. Besides, TSL is a more appropriate place for
these tests.
PiperOrigin-RevId: 643994337
Now, in order to query support information about the legacy Triton emitters,
it is necessary to call functions in the `xla::gpu::legacy_triton` namespace.
This helps clarify our tests, and most notably allows us to evolve the new
(respectively old) Triton emitters without back- (respectively front-) porting
the logic to the old (respectively new) Triton emitters.
PiperOrigin-RevId: 643988589
Imported from GitHub PR https://github.com/openxla/xla/pull/13771
Some algsimp patterns used `IsPositive()` function instead of using `IsNonNegative()` to check its operands.
As a result, the patterns did not work when the operand was >= 0. However, they should.
## Fixes:
### Fix for pattern `rsqrt(B) * rsqrt(B) => 1/B`
Issue: Pattern did not work for B >= 0, for example when B=abs(x).
Solution: Fixed by checking that B isNonNegative
Validation:
- If `B==0` the `result is inf`
- If `B>0` the `result > 0`
- If `B is inf` the `result is 0`
- If `B is nan` the `result is nan`
### Fix for pattern `rsqrt(pow(A, -2)) => A`
Issue: Pattern did not work for A >= 0, for example when A=abs(x).
Solution: Fixed by checking that A isNonNegative.
Validation (before and after the simplification):
- If `A==0` the result is `0`
- If `A>0` the `result > 0`
- If `A is inf` the `result is inf`
- If `A is nan` the `result is nan`
Additional fix: Since we know that A is non-negative we can use A directly without wrapping it with `abs()`.
### Fix for pattern `rsqrt(1/A)) => sqrt(A)`
Issue: Pattern did not work for A >= 0, for example when A=abs(x).
Solution: Fixed by checking that A isNonNegative
Validation (before and after the simplification):
- If `A==0` the result is `0`
- If `A>0` the `result > 0`
- If `A is inf` the `result is inf`
- If `A is nan` the `result is nan`
Copybara import of the project:
--
e0251aa875c38442834be6f93ae534e481d06e2a by Alexander Pivovarov <pivovaa@amazon.com>:
Fix algebraic_simplifier for rsqrt
Merging this change closes#13771
PiperOrigin-RevId: 643982420
Imported from GitHub PR https://github.com/openxla/xla/pull/13779
Here we enable XLA-managed workspace buffer for rocblas
@xla-rotation: would you please have a look ?
Copybara import of the project:
--
548701656df3a9e12bc6bae201113c5a7410f9b4 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
added memory management functions
--
b01f6c03b0ef1b9f9588fa438fa278aac68d727b by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
addressing reviewer comments
--
3b96edbee38cbfcc30779f8820956c219a512da7 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
beautified rocblas wrapper
--
d65b695a1444f59d89095031ac8e00e3e1556d61 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
added explicit nullptr
Merging this change closes#13779
PiperOrigin-RevId: 643967546
Imported from GitHub PR https://github.com/openxla/xla/pull/13722
Some number types (like complex64) are generally supported by rocBLAS but the library does not provide any solutions for autotuning. In that case, we shall fallback to use the default solution (kDefaultAlgorithm).
Besides, I have also added workspace buffer provided by XLA runtime which previously was ignored by rocBLAS.
@xla-rotation: would you please have a look ?
Copybara import of the project:
--
57645a2bfb357b9d00f6b85ae3b0e77a4b00fb61 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
use fallback default algoritm if no solutions are provided by the library
--
be8327f8e2e11f0eaf2383171340f69a23759260 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
adding space
Merging this change closes#13722
PiperOrigin-RevId: 643925837