PiperOrigin-RevId: 569813726
9.6 KiB
Using TensorFlow Securely
This document discusses the TensorFlow security model. It describes the security risks to consider when using models, checkpoints or input data for training or serving. We also provide guidelines on what constitutes a vulnerability in TensorFlow and how to report them.
This document applies to other repositories in the TensorFlow organization, covering security practices for the entirety of the TensorFlow ecosystem.
TensorFlow models are programs
TensorFlow models (to use a term commonly used by machine learning practitioners) are expressed as programs that TensorFlow executes. TensorFlow programs are encoded as computation graphs. Since models are practically programs that TensorFlow executes, using untrusted models or graphs is equivalent to running untrusted code.
If you need to run untrusted models, execute them inside a sandbox. Memory corruptions in TensorFlow ops can be recognized as security issues only if they are reachable and exploitable through production-grade, benign models.
Compilation
Compiling models via the recommended entry points described in XLA and JAX documentation should be safe, while some of the testing and debugging tools that come with the compiler are not designed to be used with untrusted data and should be used with caution when working with untrusted models.
Saved graphs and checkpoints
When loading untrusted serialized computation graphs (in form of a GraphDef
,
SavedModel
, or equivalent on-disk format), the set of computation primitives
available to TensorFlow is powerful enough that you should assume that the
TensorFlow process effectively executes arbitrary code.
The risk of loading untrusted checkpoints depends on the code or graph that you are working with. When loading untrusted checkpoints, the values of the traced variables from your model are also going to be untrusted. That means that if your code interacts with the filesystem, network, etc. and uses checkpointed variables as part of those interactions (ex: using a string variable to build a filesystem path), a maliciously created checkpoint might be able to change the targets of those operations, which could result in arbitrary read/write/executions.
Running a TensorFlow server
TensorFlow is a platform for distributed computing, and as such there is a
TensorFlow server (tf.train.Server
). The TensorFlow server is intended for
internal communication only. It is not built for use in untrusted environments
or networks.
For performance reasons, the default TensorFlow server does not include any
authorization protocol and sends messages unencrypted. It accepts connections
from anywhere, and executes the graphs it is sent without performing any checks.
Therefore, if you run a tf.train.Server
in your network, anybody with access
to the network can execute arbitrary code with the privileges of the user
running the tf.train.Server
.
Untrusted inputs during training and prediction
TensorFlow supports a wide range of input data formats. For example it can process images, audio, videos, and text. There are several modules specialized in taking those formats, modifying them, and/or converting them to intermediate formats that can be processed by TensorFlow.
These modifications and conversions are handled by a variety of libraries that have different security properties and provide different levels of confidence when dealing with untrusted data. Based on the security history of these libraries we consider that it is safe to work with untrusted inputs for PNG, BMP, GIF, WAV, RAW, RAW_PADDED, CSV and PROTO formats. All other input formats, including tensorflow-io should be sandboxed if used to process untrusted data.
For example, if an attacker were to upload a malicious video file, they could potentially exploit a vulnerability in the TensorFlow code that handles videos, which could allow them to execute arbitrary code on the system running TensorFlow.
It is important to keep TensorFlow up to date with the latest security patches and follow the sandboxing guideline above to protect against these types of vulnerabilities.
Security properties of execution modes
TensorFlow has several execution modes, with Eager-mode being the default in v2. Eager mode lets users write imperative-style statements that can be easily inspected and debugged and it is intended to be used during the development phase.
As part of the differences that make Eager mode easier to debug, the shape inference functions are skipped, and any checks implemented inside the shape inference code are not executed.
The security impact of skipping those checks should be low, since the attack scenario would require a malicious user to be able to control the model which as stated above is already equivalent to code execution. In any case, the recommendation is not to serve models using Eager mode since it also has performance limitations.
Multi-Tenant environments
It is possible to run multiple TensorFlow models in parallel. For example,
ModelServer
collates all computation graphs exposed to it (from multiple
SavedModel
) and executes them in parallel on available executors. Running
TensorFlow in a multitenant design mixes the risks described above with the
inherent ones from multitenant configurations. The primary areas of concern are
tenant isolation, resource allocation, model sharing and hardware attacks.
Tenant isolation
Since any tenants or users providing models, graphs or checkpoints can execute code in context of the TensorFlow service, it is important to design isolation mechanisms that prevent unwanted access to the data from other tenants.
Network isolation between different models is also important not only to prevent unauthorized access to data or models, but also to prevent malicious users or tenants sending graphs to execute under another tenant’s identity.
The isolation mechanisms are the responsibility of the users to design and implement, and therefore security issues deriving from their absence are not considered a vulnerability in TensorFlow.
Resource allocation
A denial of service caused by one model could bring down the entire server, but we don't consider this as a vulnerability, given that models can exhaust resources in many different ways and solutions exist to prevent this from happening (e.g., rate limits, ACLs, monitors to restart broken servers).
Model sharing
If the multitenant design allows sharing models, make sure that tenants and users are aware of the security risks detailed here and that they are going to be practically running code provided by other users. Currently there are no good ways to detect malicious models/graphs/checkpoints, so the recommended way to mitigate the risk in this scenario is to sandbox the model execution.
Hardware attacks
Physical GPUs or TPUs can also be the target of attacks. Published research shows that it might be possible to use side channel attacks on the GPU to leak data from other running models or processes in the same system. GPUs can also have implementation bugs that might allow attackers to leave malicious code running and leak or tamper with applications from other users. Please report vulnerabilities to the vendor of the affected hardware accelerator.
Reporting vulnerabilities
Vulnerabilities in TensorFlow
This document covers different use cases for TensorFlow together with comments whether these uses were recommended or considered safe, or where we recommend some form of isolation when dealing with untrusted data. As a result, this document also outlines what issues we consider as TensorFlow security vulnerabilities.
We recognize issues as vulnerabilities only when they occur in scenarios that we outline as safe; issues that have a security impact only when TensorFlow is used in a discouraged way (e.g. running untrusted models or checkpoints, data parsing outside of the safe formats, etc.) are not treated as vulnerabilities..
Reporting process
Please use Google Bug Hunters reporting form to report security vulnerabilities. Please include the following information along with your report:
- A descriptive title
- Your name and affiliation (if any).
- A description of the technical details of the vulnerabilities.
- A minimal example of the vulnerability. It is very important to let us know how we can reproduce your findings. For memory corruption triggerable in TensorFlow models, please demonstrate an exploit against one of Alphabet's models in https://tfhub.dev/
- An explanation of who can exploit this vulnerability, and what they gain when doing so. Write an attack scenario that demonstrates how your issue violates the use cases and security assumptions defined in the threat model. This will help us evaluate your report quickly, especially if the issue is complex.
- Whether this vulnerability is public or known to third parties. If it is, please provide details.
We will try to fix the problems as soon as possible. Vulnerabilities will, in general, be batched to be fixed at the same time as a quarterly release. We credit reporters for identifying security issues, although we keep your name confidential if you request it. Please see Google Bug Hunters program website for more info.