linux/arch/x86/include/asm/tdx.h
Kai Huang 70060463cb x86/mce: Differentiate real hardware #MCs from TDX erratum ones
The first few generations of TDX hardware have an erratum.  Triggering
it in Linux requires some kind of kernel bug involving relatively exotic
memory writes to TDX private memory and will manifest via
spurious-looking machine checks when reading the affected memory.

Make an effort to detect these TDX-induced machine checks and spit out
a new blurb to dmesg so folks do not think their hardware is failing.

== Background ==

Virtually all kernel memory accesses operations happen in full
cachelines.  In practice, writing a "byte" of memory usually reads a 64
byte cacheline of memory, modifies it, then writes the whole line back.
Those operations do not trigger this problem.

This problem is triggered by "partial" writes where a write transaction
of less than cacheline lands at the memory controller.  The CPU does
these via non-temporal write instructions (like MOVNTI), or through
UC/WC memory mappings.  The issue can also be triggered away from the
CPU by devices doing partial writes via DMA.

== Problem ==

A partial write to a TDX private memory cacheline will silently "poison"
the line.  Subsequent reads will consume the poison and generate a
machine check.  According to the TDX hardware spec, neither of these
things should have happened.

To add insult to injury, the Linux machine code will present these as a
literal "Hardware error" when they were, in fact, a software-triggered
issue.

== Solution ==

In the end, this issue is hard to trigger.  Rather than do something
rash (and incomplete) like unmap TDX private memory from the direct map,
improve the machine check handler.

Currently, the #MC handler doesn't distinguish whether the memory is
TDX private memory or not but just dump, for instance, below message:

 [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134
 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0}
 	...
 [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] Kernel panic - not syncing: Fatal local machine check

Which says "Hardware Error" and "Data load in unrecoverable area of
kernel".

Ideally, it's better for the log to say "software bug around TDX private
memory" instead of "Hardware Error".  But in reality the real hardware
memory error can happen, and sadly such software-triggered #MC cannot be
distinguished from the real hardware error.  Also, the error message is
used by userspace tool 'mcelog' to parse, so changing the output may
break userspace.

So keep the "Hardware Error".  The "Data load in unrecoverable area of
kernel" is also helpful, so keep it too.

Instead of modifying above error log, improve the error log by printing
additional TDX related message to make the log like:

  ...
 [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
 [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug.

Adding this additional message requires determination of whether the
memory page is TDX private memory.  There is no existing infrastructure
to do that.  Add an interface to query the TDX module to fill this gap.

== Impact ==

This issue requires some kind of kernel bug to trigger.

TDX private memory should never be mapped UC/WC.  A partial write
originating from these mappings would require *two* bugs, first mapping
the wrong page, then writing the wrong memory.  It would also be
detectable using traditional memory corruption techniques like
DEBUG_PAGEALLOC.

MOVNTI (and friends) could cause this issue with something like a simple
buffer overrun or use-after-free on the direct map.  It should also be
detectable with normal debug techniques.

The one place where this might get nasty would be if the CPU read data
then wrote back the same data.  That would trigger this problem but
would not, for instance, set off mechanisms like slab redzoning because
it doesn't actually corrupt data.

With an IOMMU at least, the DMA exposure is similar to the UC/WC issue.
TDX private memory would first need to be incorrectly mapped into the
I/O space and then a later DMA to that mapping would actually cause the
poisoning event.

[ dhansen: changelog tweaks ]

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Yuan Yao <yuan.yao@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
2023-12-12 08:46:46 -08:00

128 lines
3.3 KiB
C

/* SPDX-License-Identifier: GPL-2.0 */
/* Copyright (C) 2021-2022 Intel Corporation */
#ifndef _ASM_X86_TDX_H
#define _ASM_X86_TDX_H
#include <linux/init.h>
#include <linux/bits.h>
#include <asm/errno.h>
#include <asm/ptrace.h>
#include <asm/trapnr.h>
#include <asm/shared/tdx.h>
/*
* SW-defined error codes.
*
* Bits 47:40 == 0xFF indicate Reserved status code class that never used by
* TDX module.
*/
#define TDX_ERROR _BITUL(63)
#define TDX_SW_ERROR (TDX_ERROR | GENMASK_ULL(47, 40))
#define TDX_SEAMCALL_VMFAILINVALID (TDX_SW_ERROR | _UL(0xFFFF0000))
#define TDX_SEAMCALL_GP (TDX_SW_ERROR | X86_TRAP_GP)
#define TDX_SEAMCALL_UD (TDX_SW_ERROR | X86_TRAP_UD)
/*
* TDX module SEAMCALL leaf function error codes
*/
#define TDX_SUCCESS 0ULL
#define TDX_RND_NO_ENTROPY 0x8000020300000000ULL
#ifndef __ASSEMBLY__
#include <uapi/asm/mce.h>
/*
* Used by the #VE exception handler to gather the #VE exception
* info from the TDX module. This is a software only structure
* and not part of the TDX module/VMM ABI.
*/
struct ve_info {
u64 exit_reason;
u64 exit_qual;
/* Guest Linear (virtual) Address */
u64 gla;
/* Guest Physical Address */
u64 gpa;
u32 instr_len;
u32 instr_info;
};
#ifdef CONFIG_INTEL_TDX_GUEST
void __init tdx_early_init(void);
void tdx_get_ve_info(struct ve_info *ve);
bool tdx_handle_virt_exception(struct pt_regs *regs, struct ve_info *ve);
void tdx_safe_halt(void);
bool tdx_early_handle_ve(struct pt_regs *regs);
int tdx_mcall_get_report0(u8 *reportdata, u8 *tdreport);
u64 tdx_hcall_get_quote(u8 *buf, size_t size);
#else
static inline void tdx_early_init(void) { };
static inline void tdx_safe_halt(void) { };
static inline bool tdx_early_handle_ve(struct pt_regs *regs) { return false; }
#endif /* CONFIG_INTEL_TDX_GUEST */
#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_INTEL_TDX_GUEST)
long tdx_kvm_hypercall(unsigned int nr, unsigned long p1, unsigned long p2,
unsigned long p3, unsigned long p4);
#else
static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
unsigned long p2, unsigned long p3,
unsigned long p4)
{
return -ENODEV;
}
#endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
#ifdef CONFIG_INTEL_TDX_HOST
u64 __seamcall(u64 fn, struct tdx_module_args *args);
u64 __seamcall_ret(u64 fn, struct tdx_module_args *args);
u64 __seamcall_saved_ret(u64 fn, struct tdx_module_args *args);
void tdx_init(void);
#include <asm/archrandom.h>
typedef u64 (*sc_func_t)(u64 fn, struct tdx_module_args *args);
static inline u64 sc_retry(sc_func_t func, u64 fn,
struct tdx_module_args *args)
{
int retry = RDRAND_RETRY_LOOPS;
u64 ret;
do {
ret = func(fn, args);
} while (ret == TDX_RND_NO_ENTROPY && --retry);
return ret;
}
#define seamcall(_fn, _args) sc_retry(__seamcall, (_fn), (_args))
#define seamcall_ret(_fn, _args) sc_retry(__seamcall_ret, (_fn), (_args))
#define seamcall_saved_ret(_fn, _args) sc_retry(__seamcall_saved_ret, (_fn), (_args))
int tdx_cpu_enable(void);
int tdx_enable(void);
const char *tdx_dump_mce_info(struct mce *m);
#else
static inline void tdx_init(void) { }
static inline int tdx_cpu_enable(void) { return -ENODEV; }
static inline int tdx_enable(void) { return -ENODEV; }
static inline const char *tdx_dump_mce_info(struct mce *m) { return NULL; }
#endif /* CONFIG_INTEL_TDX_HOST */
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_X86_TDX_H */