Commit Graph

163386 Commits

Author SHA1 Message Date
A. Unique TensorFlower
36304aadbd Add RBE toolchains for Clang on Windows.
PiperOrigin-RevId: 627830607
2024-04-24 13:50:35 -07:00
A. Unique TensorFlower
a9eaf69885 Fix float16 quantization for LSTM ops
This CL fixes 2 issues:
1. in Quantize pass, for float16 quant, the pass replaces the original op with an identical one. This causes infinite loop for lstm ops. This CL adds checks to avoid such situations.
2. in SplitMergedOperands pass, const stateful operands are duplicated. But for float16 quantization, stateful operands have const->dequantize pattern. This CL add duplication for such pattern.

PiperOrigin-RevId: 627826212
2024-04-24 13:30:06 -07:00
A. Unique TensorFlower
c2059d7396 Uses intervals instead of a liveness matrix to construct all memory constraints.
PiperOrigin-RevId: 627825584
2024-04-24 13:23:12 -07:00
A. Unique TensorFlower
80d76b8af0 Add IsPredeterminedError to annotate an error buffer which doesn't have any events on the device.
PiperOrigin-RevId: 627822828
2024-04-24 13:09:14 -07:00
Eugene Zhulenev
ab771c1914 [tsl:concurrency] Disable ambiguous AndThen callback for AsyncValueRef<absl::Status>
PiperOrigin-RevId: 627814960
2024-04-24 12:49:21 -07:00
Alexander Belyaev
71d9dd0f06 [XLA:GPU][MLIR-based emitters] Add verifier for apply_indexing op.
PiperOrigin-RevId: 627807819
2024-04-24 12:42:18 -07:00
Roman Dzhabarov
f9e60c7763 Reverts 21f6e08a3f
PiperOrigin-RevId: 627800352
2024-04-24 12:07:44 -07:00
Anastasia Petrushkina
e30abe681f Reverts d9417b25e8
PiperOrigin-RevId: 627795237
2024-04-24 11:52:57 -07:00
A. Unique TensorFlower
f497887d7d Defines an alternate interface for the Memory Term Reducer that uses primitive intervals instead of liveness matrices.
PiperOrigin-RevId: 627790657
2024-04-24 11:46:30 -07:00
David Dunleavy
5371a1b2f0 Move tsl/concurrency to xla/tsl/concurrency
PiperOrigin-RevId: 627785092
2024-04-24 11:32:05 -07:00
Christian Sigg
8aef6ef46c Fix execution engine after 7da63426ac.
PiperOrigin-RevId: 627782515
2024-04-24 11:10:10 -07:00
Changhui Lin
5ea9fc1642 Update GpuTopologyProto to include multi-host information.
PiperOrigin-RevId: 627775502
2024-04-24 10:56:09 -07:00
David Dunleavy
a8b703351d Remove unused requires-gpu tag from tf_gpu_test_tags
PiperOrigin-RevId: 627774282
2024-04-24 10:51:41 -07:00
mmakevic-amd
7b201a9d70 PR #11784: [ROCm] Add Dropout support based on MIOpen API
Imported from GitHub PR https://github.com/openxla/xla/pull/11784

Copybara import of the project:

--
fc334e480802d2686907ff720a97697e51177690 by mmakevic <Milica.Makevic@amd.com>:

Add dropout descriptor for rocm

--
52cf3ad49e70b3ecf632d8bf83283a068fbb58d0 by mmakevic <Milica.Makevic@amd.com>:

Use SetRNNDescriptor version 2

Merging this change closes #11784

PiperOrigin-RevId: 627770243
2024-04-24 10:39:54 -07:00
Johannes Reifferscheid
b46d184d9d Remove unnecessary epilogue restrictions and corner cases.
- Allow injected values to be roots of epilogues
- Allow n-ary values to be injected
- Always generate an epilogue function if it's requested.
- Simplify EmitEpilogue interface to hide indexes and calling convention
  details from the caller.

This simplifies the logic in the reduction emitter somewhat.

PiperOrigin-RevId: 627770230
2024-04-24 10:26:17 -07:00
Deqiang Chen
dadbf6d47e Allow XlaHostComputeOp to the allowed list in mark_ops_for_outside_compilation.cc
PiperOrigin-RevId: 627746297
2024-04-24 09:22:32 -07:00
A. Unique TensorFlower
51a4843aff Allow for short-circuiting optimization in ShapeUtil::EqualStructure.
Note that operator& doesn't do short-circuiting, unlike operator&&.

PiperOrigin-RevId: 627745915
2024-04-24 09:07:42 -07:00
A. Unique TensorFlower
da8b29341c Don't try to patch pywrap_calibrator.so, as it was removed.
Follow-up to
c8ee776269

PiperOrigin-RevId: 627732799
2024-04-24 08:18:03 -07:00
A. Unique TensorFlower
f484927771 BUILD visibility change only
PiperOrigin-RevId: 627726838
2024-04-24 08:04:28 -07:00
Alan Kelly
75c816d7fd FC bias is per channel quantized when per channel quantization is used
PiperOrigin-RevId: 627725365
2024-04-24 07:52:24 -07:00
Quentin Khan
ff20b3c8c0 #shlo_ref Fix OSS.
- Use the correct absl::BitGen.
- Fix the definition of `std::common_type` for `shlo_ref::F16`
- Fix building `shlo_ref::F16` from values that can be converted to `float`.

PiperOrigin-RevId: 627723091
2024-04-24 07:38:16 -07:00
A. Unique TensorFlower
cc32f277ca Integrate LLVM at llvm/llvm-project@b3ca9c30de
Updates LLVM usage to match
[b3ca9c30dedf](https://github.com/llvm/llvm-project/commit/b3ca9c30dedf)

PiperOrigin-RevId: 627709586
2024-04-24 06:38:54 -07:00
Adrian Kuegel
4cfb194769 Small code simplifications in hlo_instructions.cc (NFC)
PiperOrigin-RevId: 627700209
2024-04-24 06:03:01 -07:00
Sandeep Dasgupta
d070b7c9ab Tool to convert TF SavedModel to StableHLO
Here is the signature of the provide API:

```c++
// Converts a TensorFlow model (either from a SavedModel or an MLIR module) to a
// StableHLO MLIR module.
//
// Args:
//  input_path: The path to the input TensorFlow SavedModel or MLIR module.
//  context: The MLIR context to use for parsing or creating the MLIR module.
//  exported_model_signatures: A comma-separated list of exported model
//    signatures (functions) to convert.
//  tag_names: A comma-separated list of tag names used for loading SavedModel.
//  input_arg_shapes_str: A string representation of input argument shapes.
//    Shapes for different tensors are separated by ':', and dimension sizes for
//    the same tensor are separated by ','. For example,
//    'input-arg-shapes=1,2::1,?' expresses input arguments with shapes [1,2],
//    [] and [1,?].
//  is_input_mlir_module: If true, `input_path` is treated as an MLIR
//    module instead of a SavedModel.
//
// Returns:
//   An absl::StatusOr containing the converted StableHLO MLIR module on
//   success, or an absl::Status with an error message on failure.
absl::StatusOr<OwningOpRef<ModuleOp>> TfToStablehlo(
    absl::string_view input_path, MLIRContext* context,
    absl::string_view exported_model_signatures, absl::string_view tag_names,
    absl::string_view input_arg_shapes_str, bool is_input_mlir_module = false);
```
PiperOrigin-RevId: 627698652
2024-04-24 05:49:45 -07:00
A. Unique TensorFlower
1f1cba4df5 Automated Code Change
PiperOrigin-RevId: 627677804
2024-04-24 04:08:14 -07:00
A. Unique TensorFlower
908dadf58c Automated Code Change
PiperOrigin-RevId: 627674518
2024-04-24 03:51:56 -07:00
Lu Teng
1eb72c151a PR #11306: [XLA:GPU] Fix not unique name issue in sanitize constant pass
Imported from GitHub PR https://github.com/openxla/xla/pull/11306

Fix `Instruction name is not unique` error reported by JAX UT [`ShardMapTest.test_matmul_reduce_scatter`](https://github.com/google/jax/blob/jaxlib-v0.4.24/tests/shard_map_test.py#L149-L162) in XLA:GPU.

### Background
Error message:
```
!ContainsKey(instruction_names, instruction->name()) Instruction name is not unique: param_1
```
This error is reported after `PrepareHloModuleForIrEmitting()`. The non-unique name `param_1` is generated from 2 different passes `GpuSanitizeConstantNames` and `FusionWrapper`. The related HLO changes are as follows:

1. Original HLO, there's no `param_1` and only got `param_0` in `async_computation`:
```ll
%main.19_spmd (param: s32[4,4], param.1: s32[4,8]) -> s32[2,8] {
  %param = s32[4,4]{1,0} parameter(0), sharding={devices=[2,2]<=[4]}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %param.1 = s32[4,8]{1,0} parameter(1), sharding={devices=[2,1,2]<=[4] last_tile_dim_replicate}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %dot.1 = s32[4,8]{1,0} dot(s32[4,4]{1,0} %param, s32[4,8]{1,0} %param.1), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=int32]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=158}
  %reduce-scatter-start = ((s32[4,8]{1,0}), s32[2,8]{1,0}) reduce-scatter-start(s32[4,8]{1,0} %dot.1), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"collective_backend_config":{"is_sync":false,"no_parallel_custom_call":false}}
  ROOT %reduce-scatter-done = s32[2,8]{1,0} reduce-scatter-done(((s32[4,8]{1,0}), s32[2,8]{1,0}) %reduce-scatter-start), metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
%async_computation (param_0: s32[4,8]) -> s32[2,8] {
  %param_0 = s32[4,8]{1,0} parameter(0)
  ROOT %reduce-scatter.2 = s32[2,8]{1,0} reduce-scatter(s32[4,8]{1,0} %param_0), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
```

2. `param_0` was changed to `param_1` after `GpuSanitizeConstantNames` pass:
```ll
%main.19_spmd (param: s32[4,4], param.1: s32[4,8]) -> s32[2,8] {
  %param = s32[4,4]{1,0} parameter(0), sharding={devices=[2,2]<=[4]}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %param.1 = s32[4,8]{1,0} parameter(1), sharding={devices=[2,1,2]<=[4] last_tile_dim_replicate}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %dot.1 = s32[4,8]{1,0} dot(s32[4,4]{1,0} %param, s32[4,8]{1,0} %param.1), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=int32]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=158}
  %reduce-scatter-start = ((s32[4,8]{1,0}), s32[2,8]{1,0}) reduce-scatter-start(s32[4,8]{1,0} %dot.1), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"collective_backend_config":{"is_sync":false,"no_parallel_custom_call":false}}
  ROOT %reduce-scatter-done = s32[2,8]{1,0} reduce-scatter-done(((s32[4,8]{1,0}), s32[2,8]{1,0}) %reduce-scatter-start), metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
%async_computation (param_1: s32[4,8]) -> s32[2,8] {
  %param_1 = s32[4,8]{1,0} parameter(0)
  ROOT %reduce-scatter.2 = s32[2,8]{1,0} reduce-scatter(s32[4,8]{1,0} %param_1), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
```

3. Another `param_1` was generated after `FusionWrapper` pass:
```ll
%async_computation (param_1: s32[4,8]) -> s32[2,8] {
  %param_1 = s32[4,8]{1,0} parameter(0)
  ROOT %reduce-scatter.2 = s32[2,8]{1,0} reduce-scatter(s32[4,8]{1,0} %param_1), channel_id=1, replica_groups={{0,2},{1,3},{4,6},{5,7}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
%wrapped_dot_computation (param_0.1: s32[4,4], param_1: s32[4,8]) -> s32[4,8] {
  %param_0.1 = s32[4,4]{1,0} parameter(0)
  %param_1 = s32[4,8]{1,0} parameter(1)
  ROOT %dot.2 = s32[4,8]{1,0} dot(s32[4,4]{1,0} %param_0.1, s32[4,8]{1,0} %param_1), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=int32]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=158}
}

%main.19_spmd (param: s32[4,4], param.1: s32[4,8]) -> s32[2,8] {
  %param = s32[4,4]{1,0} parameter(0), sharding={devices=[2,2]<=[4]}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %param.1 = s32[4,8]{1,0} parameter(1), sharding={devices=[2,1,2]<=[4] last_tile_dim_replicate}, metadata={op_name="jit(fwd)/jit(main)/shard_map[mesh=Mesh(\'x\': 2, \'y\': 2) in_names=({0: (\'x\',), 1: (\'y\',)}, {0: (\'x\',)}) out_names=({0: (\'x\', \'y\')},) check_rep=True rewrite=True auto=frozenset()]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=161}
  %dot.1 = s32[4,8]{1,0} dot(s32[4,4]{1,0} %param, s32[4,8]{1,0} %param.1), lhs_contracting_dims={1}, rhs_contracting_dims={0}, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/dot_general[dimension_numbers=(((1,), (0,)), ((), ())) precision=None preferred_element_type=int32]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=158}
  %reduce-scatter-start = ((s32[4,8]{1,0}), s32[2,8]{1,0}) reduce-scatter-start(s32[4,8]{1,0} %dot.1), channel_id=1, replica_groups={{0,1},{2,3}}, use_global_device_ids=true, dimensions={0}, to_apply=%region_0.7, metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}, backend_config={"operation_queue_id":"0","wait_on_operation_queues":[],"collective_backend_config":{"is_sync":false,"no_parallel_custom_call":false}}
  ROOT %reduce-scatter-done = s32[2,8]{1,0} reduce-scatter-done(((s32[4,8]{1,0}), s32[2,8]{1,0}) %reduce-scatter-start), metadata={op_name="jit(fwd)/jit(main)/jit(shmap_body)/reduce_scatter[axis_name=y scatter_dimension=0 axis_index_groups=None axis_size=2 tiled=True]" source_file="/home/sdp/tenglu/intel-jax/tests/shard_map_test.py" source_line=159}
}
```

### Root cause
`GpuSanitizeConstantNames` runs before `FusionWrapper` and may change HLO instruction name by a [local name uniquer](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_sanitize_constant_names.cc#L37). There're 2 issues here:

* The original HLO instruction name may be changed unexpectedly even though [the pass does not intend to do so](https://github.com/openxla/xla/blob/main/xla/service/gpu/gpu_sanitize_constant_names.cc#L50-L51).
* The global name uniquer of HLO module isn't aware of this change and may wrongly assign the same name to another HLO instruction in other passes, e.g. `FusionWrapper`.

`param_0` is changed to `param_1` by `GpuSanitizeConstantNames` unexpectedly.

### Solution
Only record HLO instruction names in the local name uniquer of `GpuSanitizeConstantNames` and do not change the original names. It exactly follows the pass design.
Copybara import of the project:

--
c39bc256957db681cfe1706fb716986da89edf3a by Lu Teng <teng.lu@intel.com>:

Fix not unique name issue in sanitize constant pass.

Merging this change closes #11306

PiperOrigin-RevId: 627661920
2024-04-24 02:52:42 -07:00
A. Unique TensorFlower
5e65cdf3d0 Update GraphDef version to 1842.
PiperOrigin-RevId: 627652913
2024-04-24 02:24:51 -07:00
A. Unique TensorFlower
b2d1ffadd6 compat: Update forward compatibility horizon to 2024-04-24
PiperOrigin-RevId: 627652860
2024-04-24 02:11:11 -07:00
A. Unique TensorFlower
9fab96b5ab Automated Code Change
PiperOrigin-RevId: 627644782
2024-04-24 01:51:36 -07:00
Harsha H S
427a14ea52 PR #11313: Use CHECK-DAG to validate low and high padding mlir
Imported from GitHub PR https://github.com/openxla/xla/pull/11313

Copybara import of the project:

--
98ee3b456f62a9404bfaf976ab9a2d25415293cc by Harsha HS <harsha.havanurshamsundara@amd.com>:

Distinguish order of padding operation between CUDA and ROCm

--
6f318827a4f2f31938349e24d292966fc6b95b97 by Harsha HS <harsha.havanurshamsundara@amd.com>:

Use CHECK-DAG to check for low and high padding mlir

--
8a3da357c20a865dc91f090d6e600ade492c3e7f by Harsha HS <harsha.havanurshamsundara@amd.com>:

Add -no_gpu tag to the test, so that it is picked by CI

--
f13a4003aedb5a005c3b0668c474da30b46fe5e2 by Harsha HS <harsha.havanurshamsundara@amd.com>:

Revert "Add -no_gpu tag to the test, so that it is picked by CI"

This reverts commit 8a3da357c20a865dc91f090d6e600ade492c3e7f.

--
9fca062f688dca198e232b1cfcd04eba3a8d99d2 by Harsha HS <harsha.havanurshamsundara@amd.com>:

Remove tags as it is covered by CPU CI

Merging this change closes #11313

PiperOrigin-RevId: 627638965
2024-04-24 01:44:31 -07:00
A. Unique TensorFlower
09e58b5c60 Automated Code Change
PiperOrigin-RevId: 627637071
2024-04-24 01:30:24 -07:00
A. Unique TensorFlower
193db149b6 Automated Code Change
PiperOrigin-RevId: 627636880
2024-04-24 01:04:05 -07:00
A. Unique TensorFlower
c121a459d8 Automated Code Change
PiperOrigin-RevId: 627627484
2024-04-24 00:50:19 -07:00
Trevor Morris
46e1b15508 PR #11761: [GPU] Support nccl comm splitting in multiprocess mode
Imported from GitHub PR https://github.com/openxla/xla/pull/11761

Currently `--xla_gpu_enable_nccl_comm_splitting` is a no-op unless using single process mode. This PR allows it to work in multiprocess mode by removing the IsLocal check and fixing the key in the rank -> comm map which was causing the following error:
```
7:  Communicator for rank 1 not found in a NCCL clique devices=[3,7]; stream=0
5: E0403 15:37:24.104311 3871038 pjrt_stream_executor_client.cc:2809] Execution of replica 0 failed: INTERNAL: Communicator for rank 1 not found in a NCCL clique devices=[1,5]; stream=0
6: E0403 15:37:24.104477 3870182 pjrt_stream_executor_client.cc:2809] Execution of replica 0 failed: INTERNAL: Communicator for rank 1 not found in a NCCL clique devices=[2,6]; stream=0
4: E0403 15:37:24.105872 3871021 pjrt_stream_executor_client.cc:2809] Execution of replica 0 failed: INTERNAL: Communicator for rank 1 not found in a NCCL clique devices=[0,4]; stream=0
```
Copybara import of the project:

--
c77ed08f77e252a5068a23d02555ffa765913d42 by Trevor Morris <tmorris@nvidia.com>:

Enable nccl comm split for multiprocess mode

Merging this change closes #11761

PiperOrigin-RevId: 627626547
2024-04-24 00:35:54 -07:00
A. Unique TensorFlower
7a93c411ee Automated Code Change
PiperOrigin-RevId: 627625947
2024-04-24 00:26:44 -07:00
Emilio Cota
e0c55cb113 multihost_hlo_runner: support host parameter buffers
PiperOrigin-RevId: 627625806
2024-04-24 00:20:18 -07:00
mmakevic-amd
a795106888 PR #11603: [ROCm] Configure pjrt gpu plugin for rocm
Imported from GitHub PR https://github.com/openxla/xla/pull/11603

Configure PJRT GPU plugin so it can be built for ROCm as well.
Copybara import of the project:

--
a1c8bcb4be41dc56899118d44bf604a2723a3c56 by mmakevic <Milica.Makevic@amd.com>:

Configure pjrt gpu plugin for rocm

--
9ca24357f52c53febb474c798c67d0b8dda586ee by mmakevic <Milica.Makevic@amd.com>:

Change platform name defining

Merging this change closes #11603

PiperOrigin-RevId: 627625764
2024-04-24 00:13:47 -07:00
A. Unique TensorFlower
521d646206 Automated Code Change
PiperOrigin-RevId: 627625513
2024-04-24 00:07:13 -07:00
A. Unique TensorFlower
5f858cf009 Automated Code Change
PiperOrigin-RevId: 627608517
2024-04-23 23:00:46 -07:00
A. Unique TensorFlower
0c135c2530 Automated Code Change
PiperOrigin-RevId: 627608157
2024-04-23 22:54:28 -07:00
TensorFlower Gardener
4a80864d46 Merge pull request #64520 from Intel-tensorflow:amin/xla-disable-remapper
PiperOrigin-RevId: 627598905
2024-04-23 21:52:53 -07:00
A. Unique TensorFlower
bdd3d08755 Update ops-related pbtxt files.
PiperOrigin-RevId: 627597893
2024-04-23 21:44:26 -07:00
Ce Zheng
75ef9c3f2b [XLA] Avoid crash in LatencyHidingScheduler when it fails to schedule a node and improve error message.
PiperOrigin-RevId: 627596827
2024-04-23 21:28:59 -07:00
Ziyin Huang
ff9a2058aa Add new sets of XlaSparseDenseMatmulWithStaticBufferSizeOps that takes a static buffer size as an attribute.
PiperOrigin-RevId: 627594700
2024-04-23 21:08:35 -07:00
Zixuan Jiang
d9417b25e8 Support shape transpose in hlo_sharding_util::ReshapeSharding.
Before this cl, `hlo_sharding_util::ReshapeSharding` can handle the cases where source and target shapes can be transformed to each other by merging and splitting dimension sizes. It returns `std::nullopt` if transpose is needed between source and target shapes.

This cl extracts the `gcd(source_sharding_tile_size, target_shape)` when `source_shape % source_sharding_tile_size == 0` in the major dimensions. We also skip the source_dim if `source_sharding_tile_size` is 1. An example is shown below.
```
input_shape: [6, 2, 5]
output_shape: [4, 3, 5]
input_sharding: {devices=[2, 1, 5]<=[10]}
output_sharding: {devices=[2, 1, 5]<=[10]}
```
PiperOrigin-RevId: 627592738
2024-04-23 20:54:54 -07:00
Steven Toribio
7c5bc63acd adding support for Composite ops in TFLite flatbuffer schema
PiperOrigin-RevId: 627588497
2024-04-23 20:40:50 -07:00
Thai Nguyen
c8ee776269 Remove the CalibrationSingleton
The Singleton is a legacy design that only works on CPU.

PiperOrigin-RevId: 627586362
2024-04-23 20:28:15 -07:00
A. Unique TensorFlower
5de67474fb Integrate LLVM at llvm/llvm-project@688c10d236
Updates LLVM usage to match
[688c10d23630](https://github.com/llvm/llvm-project/commit/688c10d23630)

PiperOrigin-RevId: 627579261
2024-04-23 20:21:34 -07:00
A. Unique TensorFlower
d7825e61e7 Update ops-related pbtxt files.
PiperOrigin-RevId: 627575911
2024-04-23 19:45:42 -07:00