From 035c5e2143f3edceeede1e99ff9cf8979c548dd5 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Thu, 31 Oct 2024 10:49:46 +0200 Subject: [PATCH 1/2] x86/mm/doc: Add missing details in virtual memory layout Improve memory layout documentation: - Document 4kB guard hole at the end of userspace. See TASK_SIZE_MAX definition. - Divide the description of the non-canonical hole into two parts: userspace and kernel sides. - Mention the effect of LAM on the non-canonical range. Signed-off-by: Kirill A. Shutemov Signed-off-by: Dave Hansen Link: https://lore.kernel.org/all/20241031084946.2243440-1-kirill.shutemov%40linux.intel.com --- Documentation/arch/x86/x86_64/mm.rst | 35 +++++++++++++++++++++++----- 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst index 35e5e18c83d0..f2db178b353f 100644 --- a/Documentation/arch/x86/x86_64/mm.rst +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -29,15 +29,27 @@ Complete virtual memory map with 4-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm + 00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + 0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8 EB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -128 TB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor @@ -88,15 +100,26 @@ Complete virtual memory map with 5-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm + 00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -64 PB + 0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8EB TB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: + ____________________________________________________________|___________________________________________________________ + 8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -64 PB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor From 7e33001b8b9a78062679e0fdf5b0842a49063135 Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Fri, 8 Nov 2024 19:27:50 -0500 Subject: [PATCH 2/2] x86/mm/tlb: Put cpumask_test_cpu() check in switch_mm_irqs_off() under CONFIG_DEBUG_VM On a web server workload, the cpumask_test_cpu() inside the WARN_ON_ONCE() in the 'prev == next branch' takes about 17% of all the CPU time of switch_mm_irqs_off(). On a large fleet, this WARN_ON_ONCE() has not fired in at least a month, possibly never. Move this test under CONFIG_DEBUG_VM so it does not get compiled in production kernels. Signed-off-by: Rik van Riel Signed-off-by: Ingo Molnar Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Linus Torvalds Link: https://lore.kernel.org/r/20241109003727.3958374-4-riel@surriel.com --- arch/x86/mm/tlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 86593d1b787d..b0d5a644fc84 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -568,7 +568,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next, * mm_cpumask. The TLB shootdown code can figure out from * cpu_tlbstate_shared.is_lazy whether or not to send an IPI. */ - if (WARN_ON_ONCE(prev != &init_mm && + if (IS_ENABLED(CONFIG_DEBUG_VM) && WARN_ON_ONCE(prev != &init_mm && !cpumask_test_cpu(cpu, mm_cpumask(next)))) cpumask_set_cpu(cpu, mm_cpumask(next));