mirror of
https://github.com/tensorflow/tensorflow.git
synced 2024-11-21 21:05:19 +00:00
197 lines
9.6 KiB
Markdown
197 lines
9.6 KiB
Markdown
# Using TensorFlow Securely
|
||
|
||
This document discusses the TensorFlow security model. It describes the security
|
||
risks to consider when using models, checkpoints or input data for training or
|
||
serving. We also provide guidelines on what constitutes a vulnerability in
|
||
TensorFlow and how to report them.
|
||
|
||
This document applies to other repositories in the TensorFlow organization,
|
||
covering security practices for the entirety of the TensorFlow ecosystem.
|
||
|
||
## TensorFlow models are programs
|
||
|
||
TensorFlow
|
||
[**models**](https://developers.google.com/machine-learning/glossary/#model) (to
|
||
use a term commonly used by machine learning practitioners) are expressed as
|
||
programs that TensorFlow executes. TensorFlow programs are encoded as
|
||
computation
|
||
[**graphs**](https://developers.google.com/machine-learning/glossary/#graph).
|
||
Since models are practically programs that TensorFlow executes, using untrusted
|
||
models or graphs is equivalent to running untrusted code.
|
||
|
||
If you need to run untrusted models, execute them inside a
|
||
[**sandbox**](https://developers.google.com/code-sandboxing). Memory corruptions
|
||
in TensorFlow ops can be recognized as security issues only if they are
|
||
reachable and exploitable through production-grade, benign models.
|
||
|
||
### Compilation
|
||
|
||
Compiling models via the recommended entry points described in
|
||
[XLA](https://www.tensorflow.org/xla) and
|
||
[JAX](https://jax.readthedocs.io/en/latest/jax-101/02-jitting.html)
|
||
documentation should be safe, while some of the testing and debugging tools that
|
||
come with the compiler are not designed to be used with untrusted data and
|
||
should be used with caution when working with untrusted models.
|
||
|
||
### Saved graphs and checkpoints
|
||
|
||
When loading untrusted serialized computation graphs (in form of a `GraphDef`,
|
||
`SavedModel`, or equivalent on-disk format), the set of computation primitives
|
||
available to TensorFlow is powerful enough that you should assume that the
|
||
TensorFlow process effectively executes arbitrary code.
|
||
|
||
The risk of loading untrusted checkpoints depends on the code or graph that you
|
||
are working with. When loading untrusted checkpoints, the values of the traced
|
||
variables from your model are also going to be untrusted. That means that if
|
||
your code interacts with the filesystem, network, etc. and uses checkpointed
|
||
variables as part of those interactions (ex: using a string variable to build a
|
||
filesystem path), a maliciously created checkpoint might be able to change the
|
||
targets of those operations, which could result in arbitrary
|
||
read/write/executions.
|
||
|
||
### Running a TensorFlow server
|
||
|
||
TensorFlow is a platform for distributed computing, and as such there is a
|
||
TensorFlow server (`tf.train.Server`). The TensorFlow server is intended for
|
||
internal communication only. It is not built for use in untrusted environments
|
||
or networks.
|
||
|
||
For performance reasons, the default TensorFlow server does not include any
|
||
authorization protocol and sends messages unencrypted. It accepts connections
|
||
from anywhere, and executes the graphs it is sent without performing any checks.
|
||
Therefore, if you run a `tf.train.Server` in your network, anybody with access
|
||
to the network can execute arbitrary code with the privileges of the user
|
||
running the `tf.train.Server`.
|
||
|
||
## Untrusted inputs during training and prediction
|
||
|
||
TensorFlow supports a wide range of input data formats. For example it can
|
||
process images, audio, videos, and text. There are several modules specialized
|
||
in taking those formats, modifying them, and/or converting them to intermediate
|
||
formats that can be processed by TensorFlow.
|
||
|
||
These modifications and conversions are handled by a variety of libraries that
|
||
have different security properties and provide different levels of confidence
|
||
when dealing with untrusted data. Based on the security history of these
|
||
libraries we consider that it is safe to work with untrusted inputs for PNG,
|
||
BMP, GIF, WAV, RAW, RAW\_PADDED, CSV and PROTO formats. All other input formats,
|
||
including tensorflow-io should be sandboxed if used to process untrusted data.
|
||
|
||
For example, if an attacker were to upload a malicious video file, they could
|
||
potentially exploit a vulnerability in the TensorFlow code that handles videos,
|
||
which could allow them to execute arbitrary code on the system running
|
||
TensorFlow.
|
||
|
||
It is important to keep TensorFlow up to date with the latest security patches
|
||
and follow the sandboxing guideline above to protect against these types of
|
||
vulnerabilities.
|
||
|
||
## Security properties of execution modes
|
||
|
||
TensorFlow has several execution modes, with Eager-mode being the default in v2.
|
||
Eager mode lets users write imperative-style statements that can be easily
|
||
inspected and debugged and it is intended to be used during the development
|
||
phase.
|
||
|
||
As part of the differences that make Eager mode easier to debug, the [shape
|
||
inference
|
||
functions](https://www.tensorflow.org/guide/create_op#define_the_op_interface)
|
||
are skipped, and any checks implemented inside the shape inference code are not
|
||
executed.
|
||
|
||
The security impact of skipping those checks should be low, since the attack
|
||
scenario would require a malicious user to be able to control the model which as
|
||
stated above is already equivalent to code execution. In any case, the
|
||
recommendation is not to serve models using Eager mode since it also has
|
||
performance limitations.
|
||
|
||
## Multi-Tenant environments
|
||
|
||
It is possible to run multiple TensorFlow models in parallel. For example,
|
||
`ModelServer` collates all computation graphs exposed to it (from multiple
|
||
`SavedModel`) and executes them in parallel on available executors. Running
|
||
TensorFlow in a multitenant design mixes the risks described above with the
|
||
inherent ones from multitenant configurations. The primary areas of concern are
|
||
tenant isolation, resource allocation, model sharing and hardware attacks.
|
||
|
||
### Tenant isolation
|
||
|
||
Since any tenants or users providing models, graphs or checkpoints can execute
|
||
code in context of the TensorFlow service, it is important to design isolation
|
||
mechanisms that prevent unwanted access to the data from other tenants.
|
||
|
||
Network isolation between different models is also important not only to prevent
|
||
unauthorized access to data or models, but also to prevent malicious users or
|
||
tenants sending graphs to execute under another tenant’s identity.
|
||
|
||
The isolation mechanisms are the responsibility of the users to design and
|
||
implement, and therefore security issues deriving from their absence are not
|
||
considered a vulnerability in TensorFlow.
|
||
|
||
### Resource allocation
|
||
|
||
A denial of service caused by one model could bring down the entire server, but
|
||
we don't consider this as a vulnerability, given that models can exhaust
|
||
resources in many different ways and solutions exist to prevent this from
|
||
happening (e.g., rate limits, ACLs, monitors to restart broken servers).
|
||
|
||
### Model sharing
|
||
|
||
If the multitenant design allows sharing models, make sure that tenants and
|
||
users are aware of the security risks detailed here and that they are going to
|
||
be practically running code provided by other users. Currently there are no good
|
||
ways to detect malicious models/graphs/checkpoints, so the recommended way to
|
||
mitigate the risk in this scenario is to sandbox the model execution.
|
||
|
||
### Hardware attacks
|
||
|
||
Physical GPUs or TPUs can also be the target of attacks. [Published
|
||
research](https://scholar.google.com/scholar?q=gpu+side+channel) shows that it
|
||
might be possible to use side channel attacks on the GPU to leak data from other
|
||
running models or processes in the same system. GPUs can also have
|
||
implementation bugs that might allow attackers to leave malicious code running
|
||
and leak or tamper with applications from other users. Please report
|
||
vulnerabilities to the vendor of the affected hardware accelerator.
|
||
|
||
## Reporting vulnerabilities
|
||
|
||
### Vulnerabilities in TensorFlow
|
||
|
||
This document covers different use cases for TensorFlow together with comments
|
||
whether these uses were recommended or considered safe, or where we recommend
|
||
some form of isolation when dealing with untrusted data. As a result, this
|
||
document also outlines what issues we consider as TensorFlow security
|
||
vulnerabilities.
|
||
|
||
We recognize issues as vulnerabilities only when they occur in scenarios that we
|
||
outline as safe; issues that have a security impact only when TensorFlow is used
|
||
in a discouraged way (e.g. running untrusted models or checkpoints, data parsing
|
||
outside of the safe formats, etc.) are not treated as vulnerabilities.
|
||
|
||
### Reporting process
|
||
|
||
Please use [Google Bug Hunters reporting form](https://g.co/vulnz) to report
|
||
security vulnerabilities. Please include the following information along with
|
||
your report:
|
||
|
||
- A descriptive title
|
||
- Your name and affiliation (if any).
|
||
- A description of the technical details of the vulnerabilities.
|
||
- A minimal example of the vulnerability. It is very important to let us know
|
||
how we can reproduce your findings. For memory corruption triggerable in
|
||
TensorFlow models, please demonstrate an exploit against one of Alphabet's
|
||
models in <https://tfhub.dev/>
|
||
- An explanation of who can exploit this vulnerability, and what they gain
|
||
when doing so. Write an attack scenario that demonstrates how your issue
|
||
violates the use cases and security assumptions defined in the threat model.
|
||
This will help us evaluate your report quickly, especially if the issue is
|
||
complex.
|
||
- Whether this vulnerability is public or known to third parties. If it is,
|
||
please provide details.
|
||
|
||
We will try to fix the problems as soon as possible. Vulnerabilities will, in
|
||
general, be batched to be fixed at the same time as a quarterly release. We
|
||
credit reporters for identifying security issues, although we keep your name
|
||
confidential if you request it. Please see Google Bug Hunters program website
|
||
for more info.
|