Imported from GitHub PR https://github.com/openxla/xla/pull/13879
This prevents us accidentally loading a second copy of HIP runtime in local_config_rocm. Do similar for rocblas to guard against ABI break in rocm 6.0.
Merging this change closes#13879
PiperOrigin-RevId: 651388560
`DetectUnusedVariables` can be expensive, but often we don't have symbols in the indexing map at all, so there is nothing to remove.
PiperOrigin-RevId: 651385393
Imported from GitHub PR https://github.com/openxla/xla/pull/13479
In this PR we enable some new rocm-6.2 features: mainly the missing **hipGetFuncBySymbol** in rocm_runtime, so that we had to the workaround. This affects only rocm-specific files.
@xla-rotation: could you have a look please ?
Copybara import of the project:
--
bcd2b2341887d305583161a592c23750f5ee584c by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
adding new ROCM-6.2 features
--
3eb5aa9c69e8905d9f408ab2a141130084670d29 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
solving conflicts after rebase
--
09938d6c6b9a358e6571fb13acdc1623fe205a4e by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
added blas get_version test
--
215b92dddd440f5bacee8d7f678e1f5138761e00 by Pavel Emeliyanenko <pavel.emeliyanenko@amd.com>:
added runtime_version from DeviceDescription
Merging this change closes#13479
PiperOrigin-RevId: 651378287
We already converted triton gpu dialect to nvvm in TritonGPUTOLLVMPass but since we need to lower SparseDot afterwards and we generate a gpu.thread_id in the lowering, add a pattern to also convert that to nvvm.
PiperOrigin-RevId: 651369703
Imported from GitHub PR https://github.com/openxla/xla/pull/14796
Updated result type and error thresholds for the SelectsSplitK test.
Previously this failed on Hopper.
Copybara import of the project:
--
5005f288b67a2a34ec643cfcc3fbae815b5f0ef6 by Sergey Kozub <skozub@nvidia.com>:
Fix gemm_fusion_autotuner_test on Hopper
Merging this change closes#14796
PiperOrigin-RevId: 651359673
Currently we compute an indexing map from 1-d block_id to N-d tile offset for each TiledHloInstruction. We use that indexing map to deduplicate identical tiles. To get the map we compute delinearization of block_id in SymbolicTileAnalysis.
Composition and simplification of `block_id_to_tile_offsets_indexing` is actually very computationally intensive, because it the expression has a lot of mods and floordivs from delinearization. This is not necessary for out purposes.
After this change, `TiledHloComputation` will have N-d to M-d map from N-d tile indexing into M-d tile offsets of the instruction. This way expressions in the map
are much smaller and easier to simplify (see changes in symbolic_tile_analysis_test).
This change has an additional benefit that we don't enforce 1-d launch grid at the early stage.
PiperOrigin-RevId: 651344451
Imported from GitHub PR https://github.com/openxla/xla/pull/14792
related rocm part change is missing and internal CL is merged without check due to this c40dbf2b3c
@xla-rotation @gflegar @beckerhe
Thanks in advance!
Copybara import of the project:
--
0f4236ca8a3767666ce03713fd7ae9e4d1254e5c by Chao Chen <cchen104@amd.com>:
fixed build due to c40dbf2b3c
Merging this change closes#14792
PiperOrigin-RevId: 651333429
- minimize storage uniquer invocations
- don't allocate std::functions
- don't put symbol and dims ranges in dense map in RangeEvaluator,
also don't put them in a vector first.
After this, the biggest thing left to to is to remove the MLIR simplifier,
which is now responsible for 2/3 or so of the runtime of simplify.
PiperOrigin-RevId: 651330275
This is preparing the CUDA backend for linking and compiling with libnvjitlink. The plan is to replace ptxas and nvlink command line tools eventually.
This change is so far only adding a function `CompileAndLinkUsingLibNvJitLink`, but it's not yet being used (outside of the corresponding unit tests).
PiperOrigin-RevId: 651319016
Currently, we rerun the simplifier for all results, even when only one changes.
Also, we rerun our simplifier in the last round (when the upstream simplifier
does not find any more changes), but it's not necessary, since Simplify is
idempotent.
PiperOrigin-RevId: 651317828
Imported from GitHub PR https://github.com/openxla/xla/pull/14725
This PR lowers FusedMHABackwardThunk into command buffer, the command buffer lowering knob is DebugOptions::CUDNN.
Copybara import of the project:
--
ff9156f57569cb5e88a4671a110365e79c9f857f by Shawn Wang <shawnw@nvidia.com>:
support lowering fusedMHABackward to command buffer
--
83ddf0cbadf5f0f9513c67e7bbdd7ecea4f3404c by Shawn Wang <shawnw@nvidia.com>:
fix rebase conflicts
--
9dd82651d0434beab21bed16ab2edea06611f8a0 by Shawn Wang <shawnw@nvidia.com>:
remove duplicated inclusion
Merging this change closes#14725
PiperOrigin-RevId: 651306567
Currently, triton_test_util depends on ir_emitter_triton unconditionally, but
ir_emitter_triton only gives access to the ir_emitter_triton.h header in builds
with a GPU configured.
We can make the ir_emitter_triton.h header available in all builds if we add a
stub implementation that returns errors.
PiperOrigin-RevId: 651303559