Linux kernel source tree
Go to file
Yu Zhao b1f202060a mm: remap unused subpages to shared zeropage when splitting isolated thp
Patch series "mm: split underused THPs", v5.

The current upstream default policy for THP is always.  However, Meta uses
madvise in production as the current THP=always policy vastly
overprovisions THPs in sparsely accessed memory areas, resulting in
excessive memory pressure and premature OOM killing.  Using madvise +
relying on khugepaged has certain drawbacks over THP=always.  Using
madvise hints mean THPs aren't "transparent" and require userspace
changes.  Waiting for khugepaged to scan memory and collapse pages into
THP can be slow and unpredictable in terms of performance (i.e.  you dont
know when the collapse will happen), while production environments require
predictable performance.  If there is enough memory available, its better
for both performance and predictability to have a THP from fault time,
i.e.  THP=always rather than wait for khugepaged to collapse it, and deal
with sparsely populated THPs when the system is running out of memory.

This patch series is an attempt to mitigate the issue of running out of
memory when THP is always enabled.  During runtime whenever a THP is being
faulted in or collapsed by khugepaged, the THP is added to a list. 
Whenever memory reclaim happens, the kernel runs the deferred_split
shrinker which goes through the list and checks if the THP was underused,
i.e.  how many of the base 4K pages of the entire THP were zero-filled. 
If this number goes above a certain threshold, the shrinker will attempt
to split that THP.  Then at remap time, the pages that were zero-filled
are mapped to the shared zeropage, hence saving memory.  This method
avoids the downside of wasting memory in areas where THP is sparsely
filled when THP is always enabled, while still providing the upside THPs
like reduced TLB misses without having to use madvise.

Meta production workloads that were CPU bound (>99% CPU utilzation) were
tested with THP shrinker.  The results after 2 hours are as follows:

                            | THP=madvise |  THP=always   | THP=always
                            |             |               | + shrinker series
                            |             |               | + max_ptes_none=409
-----------------------------------------------------------------------------
Performance improvement     |      -      |    +1.8%      |     +1.7%
(over THP=madvise)          |             |               |
-----------------------------------------------------------------------------
Memory usage                |    54.6G    | 58.8G (+7.7%) |   55.9G (+2.4%)
-----------------------------------------------------------------------------
max_ptes_none=409 means that any THP that has more than 409 out of 512
(80%) zero filled filled pages will be split.

To test out the patches, the below commands without the shrinker will
invoke OOM killer immediately and kill stress, but will not fail with the
shrinker:

echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
mkdir /sys/fs/cgroup/test
echo $$ > /sys/fs/cgroup/test/cgroup.procs
echo 20M > /sys/fs/cgroup/test/memory.max
echo 0 > /sys/fs/cgroup/test/memory.swap.max
# allocate twice memory.max for each stress worker and touch 40/512 of
# each THP, i.e. vm-stride 50K.
# With the shrinker, max_ptes_none of 470 and below won't invoke OOM
# killer.
# Without the shrinker, OOM killer is invoked immediately irrespective
# of max_ptes_none value and kills stress.
stress --vm 1 --vm-bytes 40M --vm-stride 50K


This patch (of 5):

Here being unused means containing only zeros and inaccessible to
userspace.  When splitting an isolated thp under reclaim or migration, the
unused subpages can be mapped to the shared zeropage, hence saving memory.
This is particularly helpful when the internal fragmentation of a thp is
high, i.e.  it has many untouched subpages.

This is also a prerequisite for THP low utilization shrinker which will be
introduced in later patches, where underutilized THPs are split, and the
zero-filled pages are freed saving memory.

Link: https://lkml.kernel.org/r/20240830100438.3623486-1-usamaarif642@gmail.com
Link: https://lkml.kernel.org/r/20240830100438.3623486-3-usamaarif642@gmail.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Tested-by: Shuang Zhai <zhais@google.com>
Cc: Alexander Zhu <alexlzhu@fb.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuang Zhai <szhai2@cs.rochester.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:03 -07:00
arch x86: remove PG_uncached 2024-09-03 21:15:46 -07:00
block block: fix detection of unsupported WRITE SAME in blkdev_issue_write_zeroes 2024-08-28 08:49:25 -06:00
certs
crypto
Documentation Docs/damon/maintainer-profile: document Google calendar for bi-weekly meetups 2024-09-09 16:39:01 -07:00
drivers vduse: avoid using __GFP_NOFAIL 2024-09-09 16:39:02 -07:00
fs x86: remove PG_uncached 2024-09-03 21:15:46 -07:00
include mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
init mm: fix typo in Kconfig 2024-09-01 20:25:45 -07:00
io_uring io_uring/kbuf: return correct iovec count from classic buffer peek 2024-08-30 10:45:54 -06:00
ipc ipc/shm, mm: drop do_vma_munmap() 2024-09-03 21:15:52 -07:00
kernel sched/numa: Fix the vma scan starving issue 2024-09-03 21:15:56 -07:00
lib maple_tree: dump error message based on format 2024-09-09 16:39:01 -07:00
LICENSES
mm mm: remap unused subpages to shared zeropage when splitting isolated thp 2024-09-09 16:39:03 -07:00
net Including fixes from bluetooth, wireless and netfilter. 2024-08-30 06:14:39 +12:00
rust Rust fixes for v6.11 2024-08-16 11:24:06 -07:00
samples kmemleak-test: add percpu leak 2024-09-01 20:25:50 -07:00
scripts kasan: simplify and clarify Makefile 2024-09-03 21:15:35 -07:00
security Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging 2024-09-01 09:18:48 +12:00
sound sound fixes for 6.11-rc6 2024-08-28 06:24:22 +12:00
tools selftests/damon: add execute permissions to test scripts 2024-09-03 21:15:57 -07:00
usr
virt KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX) 2024-08-14 12:28:24 -04:00
.clang-format
.cocciconfig
.editorconfig
.get_maintainer.ignore
.gitattributes
.gitignore kbuild: add script and target to generate pacman package 2024-07-22 01:24:22 +09:00
.mailmap ARM: SoC fixes for 6.11, part 2 2024-09-01 06:42:13 +12:00
.rustfmt.toml
COPYING
CREDITS tracing: Update of MAINTAINERS and CREDITS file 2024-07-18 14:08:42 -07:00
Kbuild
Kconfig
MAINTAINERS tools: add skeleton code for userland testing of VMA logic 2024-09-01 20:25:55 -07:00
Makefile Linux 6.11-rc6 2024-09-01 19:46:02 +12:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the reStructuredText markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.