linux/mm
Baolin Wang d2136d749d mm: support multi-size THP numa balancing
Now the anonymous page allocation already supports multi-size THP (mTHP),
but the numa balancing still prohibits mTHP migration even though it is an
exclusive mapping, which is unreasonable.

Allow scanning mTHP:
Commit 859d4adc34 ("mm: numa: do not trap faults on shared data section
pages") skips shared CoW pages' NUMA page migration to avoid shared data
segment migration. In addition, commit 80d47f5de5 ("mm: don't try to
NUMA-migrate COW pages that have other uses") change to use page_count()
to avoid GUP pages migration, that will also skip the mTHP numa scanning.
Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
issue, although there is still a GUP race, the issue seems to have been
resolved by commit 80d47f5de5. Meanwhile, use the folio_likely_mapped_shared()
to skip shared CoW pages though this is not a precise sharers count. To
check if the folio is shared, ideally we want to make sure every page is
mapped to the same process, but doing that seems expensive and using
the estimated mapcount seems can work when running autonuma benchmark.

Allow migrating mTHP:
As mentioned in the previous thread[1], large folios (including THP) are
more susceptible to false sharing issues among threads than 4K base page,
leading to pages ping-pong back and forth during numa balancing, which is
currently not easy to resolve. Therefore, as a start to support mTHP numa
balancing, we can follow the PMD mapped THP's strategy, that means we can
reuse the 2-stage filter in should_numa_migrate_memory() to check if the
mTHP is being heavily contended among threads (through checking the CPU id
and pid of the last access) to avoid false sharing at some degree. Thus,
we can restore all PTE maps upon the first hint page fault of a large folio
to follow the PMD mapped THP's strategy. In the future, we can continue to
optimize the NUMA balancing algorithm to avoid the false sharing issue with
large folios as much as possible.

Performance data:
Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
Base: 2024-03-25 mm-unstable branch
Enable mTHP to run autonuma-benchmark

mTHP:16K
Base				Patched
numa01				numa01
224.70				143.48
numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
118.05				47.43
numa02				numa02
13.45				9.29
numa02_SMT			numa02_SMT
14.80				7.50

mTHP:64K
Base				Patched
numa01				numa01
216.15				114.40
numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
115.35				47.41
numa02				numa02
13.24				9.25
numa02_SMT			numa02_SMT
14.67				7.34

mTHP:128K
Base				Patched
numa01				numa01
205.13				144.45
numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
112.93				41.88
numa02				numa02
13.16				9.18
numa02_SMT			numa02_SMT
14.81				7.49

[1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/

[baolin.wang@linux.alibaba.com: v3]
  Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:30 -07:00
..
damon
kasan fix missing vmalloc.h includes 2024-04-25 20:55:49 -07:00
kfence mm: introduce slabobj_ext to support slab object extensions 2024-04-25 20:55:51 -07:00
kmsan
backing-dev.c mm: backing-dev: use group allocation/free of per-cpu counters API 2024-04-25 20:56:12 -07:00
balloon_compaction.c
bootmem_info.c
cma_debug.c
cma_sysfs.c
cma.c
cma.h
compaction.c mm: enable page allocation tagging 2024-04-25 20:55:54 -07:00
debug_page_alloc.c mm: page_alloc: consolidate free page accounting 2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c fix missing vmalloc.h includes 2024-04-25 20:55:49 -07:00
debug.c mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
dmapool_test.c
dmapool.c
early_ioremap.c
fadvise.c
fail_page_alloc.c
failslab.c
filemap.c mm/filemap: optimize filemap folio adding 2024-04-25 20:56:09 -07:00
folio-compat.c mm: remove __set_page_dirty_nobuffers() 2024-04-25 20:56:25 -07:00
gup_test.c
gup_test.h
gup.c mm/gup: handle hugetlb in the generic follow_page_mask code 2024-04-25 20:56:23 -07:00
highmem.c
hmm.c mm/treewide: replace pXd_huge() with pXd_leaf() 2024-04-25 20:55:46 -07:00
huge_memory.c thp: add thp_get_unmapped_area_vmflags() 2024-04-25 20:56:26 -07:00
hugetlb_cgroup.c
hugetlb_vmemmap.c
hugetlb_vmemmap.h
hugetlb.c mm/gup: handle hugetlb in the generic follow_page_mask code 2024-04-25 20:56:23 -07:00
hwpoison-inject.c
init-mm.c
internal.h mm: init_mlocked_on_free_v3 2024-04-25 20:56:29 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES 2024-04-25 20:56:20 -07:00
Kconfig.debug
khugepaged.c
kmemleak.c mm/kmemleak: compact kmemleak_object further 2024-04-25 20:56:05 -07:00
ksm.c
list_lru.c
maccess.c
madvise.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
Makefile mm/kmemleak: disable KASAN instrumentation in kmemleak 2024-04-25 20:56:05 -07:00
mapping_dirty_helpers.c
memblock.c
memcontrol.c mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-04-25 20:56:16 -07:00
memfd.c
memory_hotplug.c mm: record the migration reason for struct migration_target_control 2024-04-25 20:56:06 -07:00
memory-failure.c mm: record the migration reason for struct migration_target_control 2024-04-25 20:56:06 -07:00
memory-tiers.c
memory.c mm: support multi-size THP numa balancing 2024-04-25 20:56:30 -07:00
mempolicy.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
mempool.c mempool: hook up to memory allocation profiling 2024-04-25 20:55:56 -07:00
memremap.c
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c mm: convert migrate_vma_collect_pmd to use a folio 2024-04-25 20:56:19 -07:00
migrate.c remove references to page->flags in documentation 2024-04-25 20:56:15 -07:00
mincore.c
mlock.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
mm_init.c mm: init_mlocked_on_free_v3 2024-04-25 20:56:29 -07:00
mm_slot.h
mmap_lock.c
mmap.c mm: take placement mappings gap into account 2024-04-25 20:56:28 -07:00
mmu_gather.c
mmu_notifier.c
mmzone.c
mprotect.c mm: support multi-size THP numa balancing 2024-04-25 20:56:30 -07:00
mremap.c mm: remove "prot" parameter from move_pte() 2024-04-25 20:56:24 -07:00
msync.c
nommu.c mm: remove follow_pfn 2024-04-25 20:56:12 -07:00
oom_kill.c
page_alloc.c mm: init_mlocked_on_free_v3 2024-04-25 20:56:29 -07:00
page_counter.c
page_ext.c mm: make page_ext_get() take a const argument 2024-04-25 20:56:14 -07:00
page_idle.c
page_io.c arm64: mm: swap: support THP_SWAP on hardware with MTE 2024-04-25 20:56:07 -07:00
page_isolation.c mm: page_isolation: prepare for hygienic freelists 2024-04-25 20:56:04 -07:00
page_owner.c mm: introduce slabobj_ext to support slab object extensions 2024-04-25 20:55:51 -07:00
page_poison.c
page_reporting.c
page_reporting.h
page_table_check.c
page_vma_mapped.c
page-writeback.c
pagewalk.c
percpu-internal.h mm: percpu: add codetag reference into pcpuobj_ext 2024-04-25 20:55:56 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c percpu: clean up all mappings when pcpu_map_pages() fails 2024-04-25 20:55:49 -07:00
percpu.c mm: percpu: enable per-cpu allocation tagging 2024-04-25 20:55:56 -07:00
pgalloc-track.h
pgtable-generic.c
process_vm_access.c
ptdump.c
readahead.c mm/readahead: break read-ahead loop if filemap_add_folio return -ENOMEM 2024-04-25 20:56:07 -07:00
rmap.c remove references to page->flags in documentation 2024-04-25 20:56:15 -07:00
rodata_test.c
secretmem.c
shmem_quota.c tmpfs: fix race on handling dquot rbtree 2024-03-26 11:07:23 -07:00
shmem.c mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
show_mem.c lib: add memory allocations report in show_mem() 2024-04-25 20:55:57 -07:00
shrinker_debug.c
shrinker.c
shuffle.c
shuffle.h
slab_common.c mm/slab: enable slab allocation tagging for kmalloc and friends 2024-04-25 20:55:55 -07:00
slab.h mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-04-25 20:56:16 -07:00
slub.c mm, slab: move slab_memcg hooks to mm/memcontrol.c 2024-04-25 20:56:16 -07:00
sparse-vmemmap.c
sparse.c mm: move array mem_section init code out of memory_present() 2024-04-25 20:56:16 -07:00
swap_cgroup.c
swap_slots.c arm64: mm: swap: support THP_SWAP on hardware with MTE 2024-04-25 20:56:07 -07:00
swap_state.c mm: add is_huge_zero_folio() 2024-04-25 20:56:18 -07:00
swap.c mm: add is_huge_zero_folio() 2024-04-25 20:56:18 -07:00
swap.h
swapfile.c arm64: mm: swap: support THP_SWAP on hardware with MTE 2024-04-25 20:56:07 -07:00
truncate.c
usercopy.c
userfaultfd.c mm: add pmd_folio() 2024-04-25 20:56:19 -07:00
util.c mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
vmalloc.c mm/vmalloc.c: optimize to reduce arguments of alloc_vmap_area() 2024-04-25 20:56:08 -07:00
vmpressure.c
vmscan.c mm: hold PTL from the first PTE while reclaiming a large folio 2024-04-25 20:56:08 -07:00
vmstat.c
workingset.c
z3fold.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zswap.c zswap: replace RB tree with xarray 2024-04-25 20:56:18 -07:00