TOSA Specification License ("License")
This Licence is a legal agreement between you and Arm Limited (“Arm”) for the use of Arm’s intellectual property (including, without limitation, any copyright) embodied in the relevant TOSA Specification accompanying this Licence (“Specification”). Arm licenses its intellectual property in the Specification to you on condition that you agree to the terms of this Licence. By using or copying the Specification you indicate that you agree to be bound by the terms of this Licence.
“Subsidiary” means any company the majority of whose voting shares is now or hereafter owner or controlled, directly or indirectly, by you. A company shall be a Subsidiary only for the period during which such control exists.
This Specification is NON-CONFIDENTIAL and any use by you and your Subsidiaries (“Licensee”) is subject to the terms of this Licence between you and Arm.
Subject to the terms and conditions of this Licence, Arm hereby grants to Licensee under the intellectual property in the Specification owned or controlled by Arm, a perpetual, a non-exclusive, non-transferable, non-sub-licensable, royalty-free, worldwide licence to:
-
use and copy the Specification solely for the purpose of designing and having designed products that fully complies with the Specification;
-
manufacture and have manufactured products which have been created under the licence granted in (i) above; and
-
sell, supply and distribute products which have been created under the licence granted in (i) above.
Licensee hereby agrees that the licenses granted above are conditional on implementing the Specification in products in its entirety and shall not extend to any portion or function of a product that is not itself fully compliant with the Specification.
Except as expressly licensed above, Licensee acquires no right, title or interest in any Arm technology or any intellectual property embodied therein.
Your access to the information in the Specification is conditional upon your acceptance that you will not use or permit others to use the information for the purposes of determining whether implementations infringe any third party patents.
THE SPECIFICATION IS PROVIDED “AS IS”. ARM PROVIDES NO REPRESENTATIONS AND NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY, SATISFACTORY QUALITY, NON-INFRINGEMENT OR FITNESS FOR A PARTICULAR PURPOSE WITH RESPECT TO THE SPECIFICATION. Arm may make changes to the Specification at any time and without notice. For the avoidance of doubt, Arm makes no representation with respect to, and has undertaken no analysis to identify or understand the scope and content of, third party patents, copyrights, trade secrets, or other rights.
NOTWITHSTANDING ANYTHING TO THE CONTRARY CONTAINED IN THIS LICENCE, TO THE FULLEST EXTENT PERMITTED BY LAW, IN NO EVENT WILL ARM BE LIABLE FOR ANY DAMAGES, IN CONTRACT, TORT OR OTHERWISE, IN CONNECTION WITH THE SUBJECT MATTER OF THIS LICENCE (INCLUDING WITHOUT LIMITATION: (I) LICENSEE’S USE OF THE SPECIFICATION; AND (II) THE IMPLEMENTATION OF THE SPECIFICATION IN ANY PRODUCT CREATED BY LICENSEE UNDER THIS LICENCE). THE EXISTENCE OF MORE THAN ONE CLAIM OR SUIT WILL NOT ENLARGE OR EXTEND THE LIMIT. LICENSEE RELEASES ARM FROM ALL OBLIGATIONS, LIABILITY, CLAIMS OR DEMANDS IN EXCESS OF THIS LIMITATION.
This Licence shall remain in force until terminated by Licensee or by Arm. Without prejudice to any of its other rights, if Licensee is in breach of any of the terms and conditions of this Licence then Arm may terminate this Licence immediately upon giving written notice to Licensee. Licensee may terminate this Licence at any time. Upon termination of this Licence by Licensee or by Arm, Licensee shall stop using the Specification and destroy all copies of the Specification in its possession. Upon termination of this Licence, all terms shall survive except for the licence grants.
Any breach of this Licence by a Subsidiary shall entitle Arm to terminate this Licence as if you were the party in breach. Any termination of this Licence shall be effective in respect of all Subsidiaries. Any rights granted to any Subsidiary hereunder shall automatically terminate upon such Subsidiary ceasing to be a Subsidiary.
The Specification consists solely of commercial items. Licensee shall be responsible for ensuring that any use, duplication or disclosure of the Specification complies fully with any relevant export laws and regulations to assure that the Specification or any portion thereof is not exported, directly or indirectly, in violation of such export laws.
This Licence may be translated into other languages for convenience, and Licensee agrees that if there is any conflict between the English version of this Licence and any translation, the terms of the English version of this Licence shall prevail.
The Arm corporate logo and words marked with ® or ™ are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. Other brands and names mentioned in this Specification may be the trademarks of their respective owners. No licence, express, implied or otherwise, is granted to Licensee under this Licence, to use the Arm trade marks in connection with the Specification or any products based thereon. Visit Arm’s website at https://www.arm.com/company/policies/trademarks for more information about Arm’s trademarks.
The validity, construction and performance of this Licence shall be governed by English Law.
Copyright © 2020-2023 Arm Limited (or its affiliates). All rights reserved.
Arm Limited. Company 02557590 registered in England. 110 Fulbourn Road, Cambridge, England CB1 9NJ.
1. Introduction
1.1. Overview
Tensor Operator Set Architecture (TOSA) provides a set of whole-tensor operations commonly employed by Deep Neural Networks. The intent is to enable a variety of implementations running on a diverse range of processors, with the results at the TOSA level consistent across those implementations. Applications or frameworks which target TOSA can therefore be deployed on a wide range of different processors, such as SIMD CPUs, GPUs and custom hardware such as NPUs/TPUs, with defined accuracy and compatibility constraints. Most operators from the common ML frameworks (TensorFlow, PyTorch, etc.) should be expressible in TOSA. It is expected that there will be tools to lower from ML frameworks into TOSA.
1.2. Goals
The goals of TOSA include the following:
-
A minimal and stable set of tensor-level operators to which machine learning framework operators can be reduced.
-
Full support for both quantized integer and floating-point content.
-
Precise functional description of the behavior of every operator, including the treatment of their numerical behavior in the case of precision, saturation, scaling, and range as required by quantized datatypes.
-
Agnostic to any single high-level framework, compiler backend stack or particular target.
-
The detailed functional and numerical description enables precise code construction for a diverse range of targets – SIMD CPUs, GPUs and custom hardware such as NPUs/TPUs.
1.3. Specification
The TOSA Specification is written as AsciiDoc mark-up and developed in its raw mark-up form, managed through a git repository here: https://git.mlplatform.org/tosa/specification.git/. The specification is developed and versioned much like software. While the mark-up is legible and can be read fairly easily in its raw form, it is recommended to build or “render” the mark-up into PDF or HTML. To do this, please follow the instructions in the README.md in the root of the specification repository.
1.4. Operator Selection Principles
TOSA defines a set of primitive operators to which higher level operators can be lowered in a consistent way. To remain effective and efficient to implement, the set of operators must be constrained to a reasonably small set of primitive operations out of which others can be constructed. The following principles govern the selection of operators within TOSA.
ID | Principle | Reason for this |
---|---|---|
P0 | An operator shall be a primitive operation or building block that cannot be decomposed into simpler whole tensor operations. | If the operator can be broken down, then we should look at the component operators. |
P1 | An operator shall be a usable as a component out of which more complex operations can be constructed. | Single use operators have a high architectural cost and a more reusable version should be considered instead. |
P2 | Precision should be appropriate for the input and output data types. | Precision higher than that needed to calculate the result leads to extra implementation cost. |
P3 | Numerical definition of common sub-operations should be consistent between operators (for example: value scaling). | Consistent sub-operation definition reduces the operator implementation cost. |
P4 | The valid input and output ranges for all arguments shall be specified. | Ranges are required to make consistent (numerically agreeing) implementations possible. |
P5 | Integer operators shall be implementable in a bit-exact form with good efficiency on CPU, GPU and hardware targets. | Reduces implementation cost and gives consistent inference results. |
1.5. Profiles
TOSA supports three profiles that enable efficient implementation on different classes of device. The Base Inference profile is intended for embedded integer/fixed-point designs performing inference only. The Main Inference profile is intended for general inference functionality including integer and floating-point data types. The Main Training profile adds training operators in addition to inference operators. This version of the specification covers the Base Inference and Main Inference profiles. Main Training profile is expected in a later version of the specification. The following table summarizes the three profiles:
Profile | Name | Integer Inference | Floating-point Inference | Training |
---|---|---|---|---|
Base Inference | TOSA-BI | Yes | No | No |
Main Inference | TOSA-MI | Yes | Yes | No |
Main Training | TOSA-MT | Yes | Yes | Yes |
1.6. Levels
A TOSA level defines operator argument ranges that an implementation shall support. This is distinct from a profile that defines the operations and data-types supported. This version of the specification defines two TOSA levels:
-
No level : allows the full range of arguments specified by the operations according to the operation data types.
-
Level 8K : ranges are expected to be sufficient for applications with frame sizes up to 8K.
Later versions of the specification may define additional levels. The following table defines the value ranges for Level 1.0. These ranges are checked using the LEVEL_CHECK() function with the operator descriptions.
tosa_level_t | tosa_level_none | tosa_level_8K |
Description | No level | Level 8K |
MAX_RANK | 32 | 6 |
MAX_KERNEL | 2147483647 | 8192 |
MAX_STRIDE | 2147483647 | 8192 |
MAX_SCALE | 2048 | 256 |
MAX_LOG2_SIZE | 63 | 31 |
MAX_NESTING | 256 | 6 |
1.7. Status
The TOSA specification is a work in progress.
-
The Base Inference profile should be considered to be near release quality, with conformance tests available.
-
The Main Inference profile has most of the expected operators in place, but is still subject to change.
-
The reference model and conformance tests do not yet support all of the floating point types that have been defined.
-
There is not currently a conformance test suite available for Main Inference.
-
Main Training profile is pre-alpha, significant work still needs to be done for the profile, and no conformance tests are available.
1.8. Compliance
This section defines when a TOSA implementation is compliant to a given TOSA specification profile and level. To be compliant an implementation must achieve the results and accuracy defined by this specification. TOSA also defines a set of conformance tests. A compliant implementation must pass the conformance tests. The conformance tests are not exhaustive, so an implementation that passes the conformance tests may not be compliant if there is a non-compliance that is undetected by the tests.
1.8.1. Base Inference Profile Compliance
The Operator Graphs section of this specification defines a TOSA graph and the behavior defined for a TOSA graph. This behavior is captured in the pseudo-code function tosa_execute_graph(). For a given input graph (with attributes) and input tensors there are three possible tosa_graph_result values after executing the graph:
-
tosa_unpredictable: The result of the graph on the given inputs cannot be relied upon.
-
tosa_error: The graph does not meet the specification and is recognised as an illegal graph.
-
tosa_valid: The result is defined and predictable and the list of output tensors defines the result.
An implementation is compliant to the TOSA Baseline Inference Profile if it matches the above results as follows:
-
For tosa_unpredictable, the implementation can return whatever result it chooses (including error)
-
For tosa_error, the implementation must return an error result (and there is no requirement on how much of the graph is executed, if any)
-
For tosa_valid, the implementation must execute the entire graph without error and return the result defined by this specification.
In terms of psuedo-code, if graph is a TOSA graph consisting of Baseline Inference Profile operators and input_list is a list of input tensors then the following test must pass.
bool tosa_test_compliance(tosa_graph_t graph, tosa_list_t input_list, tosa_level_t level) {
shape_list_t output_list_spec = tosa_allocate_list(tosa_output_shape(graph));
shape_list_t output_list_test = tosa_allocate_list(tosa_output_shape(graph));
tosa_graph_result = tosa_valid; // result starts as valid
tosa_nesting_depth = 0; // if/while nesting level
tosa_execute_graph(graph, input_list, output_list_spec, level);
if (tosa_graph_result == tosa_unpredictable) {
return true; // No requirement to match an unpredictable result
}
result_test = execute_implementation_under_test(graph, input_list, output_list_test);
if (tosa_graph_result == tosa_error) {
return result_test == tosa_error; // result must be an error
}
if (exact_tensor_match(output_list_spec, output_list_test)) {
// Predictable bit-exact value match required
return true;
}
return false;
}
1.8.2. Main Inference Profile Compliance
A Main Inference compliant implementation must satisfy the following:
-
The implementation must meet Base Inference Profile Compliance for all Base inference complaint graphs
-
The implementation must support all Main Inference operations using the datatype fp32_t
-
The operations must meet the precision requirements of Main Inference precision requirements
-
-
The implementation must support all Main Inference operations using the datatype fp16_t
-
The operations must meet the precision requirements of Main Inference precision requirements
-
Note: These requirements allow fp16_t operations to be implemented using the fp32_t datatype
-
-
The implementation must support all Main Inference operations using the datatype bf16_t
-
The operations must meet the precision requirements of Main Inference precision requirements
-
Note: These requirements allow bf16_t operations to be implemented using the fp32_t datatype
-
As with Base Inference Profile Compliance the pseudo-code function tosa_execute_graph() can return one of three possible results. A compliant implementation must satisfy the following:
-
For a graph returning tosa_error the implementation must also return an error
-
For a graph returning tosa_valid the implementation must execute the entire graph without error
-
For a graph returning tosa_valid and consisting only of integer operators the results must match exactly
Main Inference precision requirements
In a compliant implementation, individual-floating point operations within the graph must meet the following accuracy bounds listed in the table below. In the table ulp means unit of the last place.
Note | The error criteria in this section are at an early draft stage and are likely to change during conformance test development. |
The following criteria apply to all operations:
-
If any input is a NaN and the result is floating-point then the result must be a NaN
-
If any input is a NaN and the operation is a comparison (greater, greater-equal, equal) then the result must be false
-
if any input is a NaN and the operation is conversion to an integer or boolean then the result is unpredictable
Operation | Accuracy bound |
---|---|
ARGMAX, MAX_POOL2D, CLAMP, MAXIMUM, MINIMUM, ABS, NEGATE, , CONST, IDENTITY | Non NaN results must be exact. |
The result must be exact with: | |
CONV2D, CONV3D, DEPTHWISE_CONV2D, FULLY_CONNECTED, MATMUL, TRANSPOSE_CONV2D | Each output can be expressed as a dot product of two input vectors. |
Each output can be expressed as a dot product of an input vector with a constant coefficient vector. | |
Floating-point result overflows must be set to infinity of the correct sign. | |
Floating-point result overflows must be set to infinity of the correct sign. | |
If the input is a zero or the result overlows the output must be an infinity of the same sign. | |
If the input is less than zero the result must be a NaN. | |
If the input to LOG is less than zero then the result must be a NaN. | |
Each output can be expressed as a dot product of an input vector with a vector of ones. | |
Each output can be expressed as a dot product of an input vector with a vector with elements 1/KS where KS is the kernel size. | |
Result overflows must be set to an infinity of the correct sign. |
Dot product accuracy requirements
This section assumes an operation acting on tensors named 'input', 'weight' and optionally 'bias'. Each output tensor element can be expressed as a dot product of elements between the 'input' and 'weight' tensors with optional bias addition. The dot product has length KS, the kernel size. If the operation does not specify a bias then 'bias' is taken to be zero in this section. Note: KS is defined for each relevant operator in the appendix section Main Inference operator test data.
In other words, each output element out
can be expressed as a dot product between input elements in[k]
, weight elements w[k]
, bias b
:
out = in[0] * w[0] + in[1] * w[1] + … + in[KS-1] * w[KS-1] + b
The positions of in[k]
, w[k]
, b
in the input, weight and bias tensors depends on the operation being performed. This may be, for example, a convolution.
This section defines the accuracy required for these operations. In this section:
-
"fp64 arithmetic" refers to double-precision floating-point arithmetic defined by IEEE 754 (Other publications[1])
-
operation_fp64()
is an fp64 reference implementation of the operation -
operation_imp()
is the implementation under test -
local_bound
is defined as follows:-
For operations with a local_bound attribute it is the value of the optional attribute, with default value of false
-
For operations that do not have a local_bound attribute the value is true
-
The checks described in the following code must pass for the following data sets:
-
Data sets defined for the operation in Appendix A Main Inference operator test data.
-
Data sets that have at least MIN_DOT_PRODUCT different output values. For these data sets we take S=-1.
output_ref = operation_fp64(input, weight, bias);
output_imp = operation_imp (input, weight, bias);
input_abs = abs(input); // Element-wise absolute
weight_abs = abs(weight); // Element-wise absolute
bias_abs = abs(bias); // Element-wise absolute
if (!local_bound) {
input_abs_max = max_value(input_abs); // maximum over all elements
for_each(index in shape(input_abs) {
input_abs[index] = input_abs_max; // set all entries to global maximum
}
}
output_bnd = operation_fp64(input_abs, weight_abs, bias_abs);
size_t T = tensor_size(output_shape) // number dot product results
size_t ksb = (max_value(bias_abs) > 0) ? (KS + 1) : KS; // kernel size and bias
fp64_t out_err_sum = 0.0;
fp64_t out_err_sumsq = 0.0;
fp64_t acc_prec; // 1<<(M+1) where M is the number of mantissa bits
fp64_t acc_min_normal; // accumulator minimum normal greater than zero
fp64_t two_m63 = -1.0/(fp64)((int64_t)-1<<63); // pow(2, -63)
switch (acc_t) {
case fp32_t: acc_prec = static_cast<fp64_t>(1<<24); // pow(2, 24)
acc_min_normal = two_m63 * two_m63; // pow(2, -126)
break;
case fp16_t: acc_prec = static_cast<fp64_t>(1<<11); // pow(2, 11)
acc_min_normal = 1.0/static_cast<fp64_t>(1<<14); // pow(2, -14)
break;
default: ERROR_IF(true);
}
for_each(index in output_shape) {
fp64_t out_bnd = tensor_read<fp64_t>(output_bnd, output_shape, index);
fp64_t out_ref = tensor_read<fp64_t>(output_ref, output_shape, index);
acc_t out_imp = tensor_read<acc_t> (output_imp, output_shape, index);
fp64_t out_err;
if ((acc_t)out_bnd == infinity) {
// dot product can overflow and there is no accuracy limit
out_err = 0.0;
} else if (out_bnd == 0.0) {
REQUIRE(out_ref == 0.0 && out_imp == 0.0);
out_err = 0.0;
} else { // 0.0 < out_bnd < infinity
out_bnd = max(out_bnd, acc_min_normal);
out_err = (static_cast<fp64_t>(out_imp) - out_ref) * acc_prec / out_bnd;
REQUIRE(abs(out_err) <= ksb);
}
out_err_sum += out_err;
out_err_sumsq += out_err * out_err;
}
if (input and weights are data set S with 3 <= S <= 5) {
// check output error bias magnitude for data sets S which are not positive biased
REQUIRE(abs(out_err_sum) <= 2*sqrt(ksb*T));
}
// check output error variance magnitude
REQUIRE(out_err_sumsq <= 0.4*ksb*T)
1.9. Tensor Definitions
1.9.1. Tensors
Tensors are multidimensional arrays of data. Tensors have metadata associated with them that describe characteristics of the tensor, including:
-
Data Type
-
Shape
The number of dimensions in a shape is called the rank. A tensor with rank equal to zero is permitted. In that case, the tensor has a single entry and is also known as a scalar. A tensor shape is an array of integers of size equal to the rank of the tensor. Each element in the tensor shape describes the number of elements in the dimension. The tensor shape in each dimension must be greater than or equal to 1. For tensor access information, see Tensor Access Helpers.
The shape of a tensor of non-zero rank is itself a tensor of rank 1 with elements of type shape_t. The single dimension has size which is the rank of the original tensor. In this specification a shape-tensor means a rank 1 tensor with elements of type shape_t. The components of a shape tensor are rank 0 tensors of type shape_t.
Some operations can process rank 0 or rank 1 tensors of type shape_t. For these operations, shape_t is permitted as an input or output tensor data type. In this version of the specification, shape_t values must be resolvable to constants at backend compile time.
1.9.2. Tensor size limit
The tensor overall size is limited by the data type size_t. This type must be able to hold integers in the range 0 to (1<<(MAX_LOG2_SIZE+1)) - 1 where MAX_LOG2_SIZE is defined in Levels. For each tensor, the number of tensor elements multiplied by the element size in bytes (which is taken to be 1 for elements smaller than a 8-bit) must be less than or equal to (1<<(MAX_LOG2_SIZE+1)) - 1.
The size of tensors along each of their dimensions is limited by the data type index_t. This type must be able to hold integers in the range 0 to (1<<MAX_LOG2_SIZE) - 1 where MAX_LOG2_SIZE is defined in Levels. This means that the maximum size of a tensor along each dimension is (1<<MAX_LOG2_SIZE) - 1 and therefore the maximum coordinate value is (1<<MAX_LOG2_SIZE) - 2. Indices used to access tensors must be non-negative.
The type shape_t, used in shape tensors, must be able to hold integers in the range -(1<<MAX_LOG2_SIZE) to (1<<MAX_LOG2_SIZE) - 1.
1.9.3. Data Layouts
The following data layouts are supported in TOSA. TOSA operations are defined in terms of a linear packed tensor layout. In a linear packed layout a rank r tensor has elements of dimension (r-1) consecutive. The next to increment is dimension (r-2) and so on. For a specification of this layout see the tensor read and write functions in section Tensor Access Helpers.
An implementation of TOSA can choose a different tensor memory layout provided that the operation behavior is maintained.
Name | Description of dimensions | Usage |
---|---|---|
NHWC | Batch, Height, Width, Channels | Feature maps |
NDHWC | Batch, Depth, Height, Width, Channels | Feature maps for 3D convolution |
OHWI | Output channels, Filter Height, Filter Width, Input channels | Weights |
HWIM | Filter Height, Filter Width, Input channels, Channel Multiplier | Weights for depthwise convolutions |
DOHWI | Depth, Output Channels, Filter Height, Filter Width, Input Channels | Weights for 3D convolution |
1.9.4. Broadcasting
In operations where broadcasting is supported, an input shape dimension can be broadcast to an output shape dimension if the input shape dimension is 1. TOSA broadcast requires the rank of both tensors to be the same. A RESHAPE can be done to create a compatible tensor with appropriate dimensions of size 1. To map indexes in an output tensor to that of an input tensor, see [Broadcast Helper].
1.9.5. Supported Number Formats
The following number formats are defined in TOSA. The number formats supported by a given operator are listed in its table of supported types.
Format | Minimum | Maximum | Description |
---|---|---|---|
bool_t | - | - | Boolean value. Size implementation defined. The TOSA reference model implements this as int8_t with 0 for false and 1 for true. All non-zero values are accepted on input as true. |
i4_t | - | - | Signless 4-bit integer type. Will be interpreted as int4_t by all operators |
int4_t | -7 | +7 | Signed 4-bit two’s-complement value. Excludes -8 to maintain a symmetric about zero range for weights. |
i8_t | - | - | Signless 8-bit integer value. Will be interpreted as int8_t unless otherwise specified by an operator. |
int8_t | -128 | +127 | Signed 8-bit two’s-complement value. |
uint8_t | 0 | 255 | Unsigned 8-bit integer value. |
i16_t | - | - | Signless 16-bit integer type. Will be interpreted as int16_t unless otherwise specified by an operator. |
int16_t | -32768 | +32767 | Signed 16-bit two’s-complement value. |
uint16_t | 0 | 65535 | Unsigned 16-bit value. |
i32_t | - | - | Signless 32-bit integer value. Will be interpreted as int32_t by all operators. |
int32_t | -(1<<31) | (1<<31)-1 | Signed 32-bit two’s-complement value. |
i48_t | - | - | Signless 32-bit integer value. Will be interpreted as int48_t by all operators. |
int48_t | -(1<<47) | (1<<47)-1 | Signed 48-bit two’s-complement value. |
fp16_t | -infinity | +infinity | 16-bit half-precision floating-point defined by Other publications[1]. |
bf16_t | -infinity | +infinity | 16-bit brain floating-point defined as bits [31:16] of the fp32_t format. |
fp32_t | -infinity | +infinity | 32-bit single-precision floating-point defined by Other publications[1]. |
fp64_t | -infinity | + infinity | 64-bit double-precision floating-point defined by Other publications[1]. |
Note: In this specification minimum<type> and maximum<type> will denote the minimum and maximum values of the data as stored in memory (ignoring the zero point). The minimum and maximum values for each type is given in the preceeding table.
Note: Integer number formats smaller than 8 bits may be used provided that the numerical result is the same as using a sequence of 8-bit TOSA operations. For example, a convolution with low precision data must equal that of running the convolution at 8 bits and then clipping the result to the peritted output range. This ensures that a Base Inference profile TOSA implementation can calculate the same result.
1.10. Integer Behavior
TOSA integer inputs and outputs are specified by signless values with the given number of bits. Unless otherwise specified, these values will be interpreted as signed twos-complement. The pseudocode will use int*_t to indicate use as a signed value and uint*_t to indicate use as an unsigned value. If overflow occurs doing integer calculation, the result is unpredictable, as indicated by the REQUIRE checks in the pseudocode for the operators.
Unsigned 8 and 16-bit values are only allowed in the RESCALE operation, to allow for compatibility with networks which expect unsigned 8-bit or 16-bit tensors for input and output.
1.10.1. Quantization
Machine Learning frameworks may represent tensors with a quantized implementation, using integer values to represent the original floating-point numbers. TOSA integer operations do not perform any implicit scaling to represent quantized values. Required zero point values are passed to the operator as necessary, and will be processed according to the pseudocode for each operator.
To convert a network containing quantized tensors to TOSA, generate explicit RESCALE operators for any change of quantization scaling. This reduces quantized operations to purely integer operations.
As an example, an ADD between two quantized tensors requires the integer values represent the same range. The scale arguments for RESCALE can be calculated to ensure that the resulting tensors represent the same range. Then the ADD is performed, and a RESCALE can be used to ensure that the result is scaled properly.
RESCALE provides support for per-tensor and per-channel scaling values to ensure compatibility with a range of possible quantization implementations.
1.10.2. Precision scaling
TOSA uses the RESCALE operation to scale between values with differing precision. The RESCALE operator is defined using an integer multiply, add, and shift. This guarantees that all TOSA implementations will return the same result for a RESCALE, including those with no support for floating-point numbers.
This TOSA specification supports two precisions of multiplier: 16-bit and 32-bit. The 32-bit multiplier version supports two rounding modes to enable simpler lowering of existing frameworks that use two stage rounding. All arithmetic is designed so that it does not overflow a 64-bit accumulator and that the final result fits in 32 bits. In particular a 48-bit value can only be scaled with the 16-bit multiplier.
The apply_scale functions provide a scaling of approximately (multiplier * 2-shift). The shift and value range is limited to allow a variety of implementations. The limit of 62 on shift allows the shift to be decomposed as two right shifts of 31. The limit on value allows implementations that left shift the value before the multiply in the case of shifts of 32 or less. For example, in the case shift=30 an implementation of the form ((value<<2) * multiplier + round)>>32 can be used. A scaling range of 2+12 down to 2-32 is supported for both functions with a normalized multiplier.
For example, in typical usage a scaling of m*2-n where m is a fraction in the range 1.0 <= m < 2.0 can be represented using multiplier=(1<<30)*m, shift=(30+n) for apply_scale_32() and multiplier=(1<<14)*m, shift=(14+n) for apply_scale_16(). The values to achieve a scaling of 1.0 are shift=30, multiplier=1<<30 for apply_scale_32 and shift=14, multiplier=1<<14 for apply_scale_16.
int32_t apply_scale_32(int32_t value, int32_t multiplier, int8_t shift, bool_t double_round=false) {
REQUIRE(multiplier >= 0);
REQUIRE(2 <= shift && shift <= 62);
REQUIRE(value >= (-1 << (shift - 1)) && value < (1 << (shift - 1)));
int64_t round = 1 << (shift - 1);
if (double_round) {
if (shift > 31 && value >= 0) round += 1<<30;
if (shift > 31 && value < 0) round -= 1<<30;
}
int64_t result = static_cast<int64_t>(value) * multiplier + round;
result = result >> shift;
// result will fit a 32-bit range due to the REQUIRE on value
return static_cast<int32_t>(result);
}
int32_t apply_scale_16(int48_t value, int16_t multipler, int8_t shift) {
REQUIRE(multiplier >= 0);
REQUIRE(2 <= shift && shift <= 62);
int64_t round = (1 << (shift - 1));
int64_t result = static_cast<int64_t>(value) * multiplier + round;
result = result >> shift;
REQUIRE(result >= minimum<int32_t> && result <= maximum<int32_t>);
return static_cast<int32_t>(result);
}
In some functions, the multiplier and shift are combined into a scale_t structure:
typedef struct {
int32_t multiplier;
int8_t shift;
} scale_t;
In places where a divide is required, we also use the function below to calculate an appropriate scaling value.
scale_t reciprocal_scale(uint32_t value) {
REQUIRE(value > 0);
scale_t scale;
int32_t k = 32 - count_leading_zeros(value - 1); // (1 << k) / 2 < value <= (1 << k)
int64_t numerator = ((1 << 30) + 1) << k;
scale.multiplier = numerator / value; // (1 << 30) <= multiplier < (1 << 31)
scale.shift = 30 + k;
return scale;
}
1.10.3. Integer Convolutions
For the convolution operators, the input is not required to be scaled. The integer versions of the convolution operators will subtract the zero point from the integer values as defined for each operator. The convolution produces an accumulator output of type int32_t or int48_t. This accumulator output is then scaled to the final output range using the RESCALE operator. The scale applied in the RESCALE operator should be set to multiplier and shift values such that: multiplier * 2-shift = (input scale * weight scale) / output_scale. Here, input_scale, weight_scale and output_scale are the conversion factors from integer to floating-point for the input, weight and output tensor values respectively. If per-channel scaling is needed then the per-channel option of the RESCALE operation should be used.
1.10.4. Integer Elementwise Operators
When two quantized tensors are used in an operation, they must represent the same numeric range for the result to be valid. In this case, TOSA expects that RESCALE operators will be used as necessary to generate 32-bit integer values in a common range. There are many valid choices for scale factors and options for the common range. TOSA does not impose a requirement on which scale factors and range should be used. Compilers generating TOSA sequences should choose a range that allows the operation to be computed without overflow, while allowing the highest possible accuracy of the output.
1.10.5. General Unary Functions
General unary functions such as sigmoid(), tanh(), exp() for integer inputs are expressed using a lookup table and interpolation to enable efficient implementation. This also allows for other operations with the addition of user-supplied tables (the TABLE operation). All table lookups are based on the following reference lookup function that takes as input a table of 513 entries of 16 bits each.
int32_t apply_lookup_s(int16_t *table, int32_t value)
{
int16_t clipped_value = static_cast<int16_t>(apply_clip_s<int32_t>(value, -32768, +32767));
int32_t index = (clipped_value + 32768) >> 7;
int32_t fraction = clipped_value & 0x7f;
int16_t base = table[index];
int16_t next = table[index+1];
int32_t slope = next - base;
REQUIRE(slope >= minimum<int16_t> && slope <= maximum<int16_t>)
int32_t return_value = (base << 7) + slope * fraction;
return return_value; // return interpolated value of 16 + 7 = 23 bits
}
Note that although the table lookup defined here has 16-bit precision, for 8-bit only operations an 8-bit table can be derived by applying the reference function to each of the possible 256 input values. The following code constructs a 513-entry table based on a reference function.
void generate_lookup_table(int16_t *table, int32_t (*reference)(int32_t))
{
for (int i = -256; i <= 256; i++) {
int32_t value = (*reference)(i);
table[i + 256] = static_cast<int16_t>(apply_clip<int32_t>(value, -32768, +32767));
}
}
1.11. Other publications
The following publications are referred to in this specification, or provide more information:
-
IEEE Std 754-2008, IEEE Standard for Floating-point Arithmetic, August 2008.
2. Operators
2.1. Operator Arguments
Operators process input arguments to produce output arguments. Their behavior can be configured using attribute arguments. Arguments may have one of the following types:
-
tensor_t<element_type>
, abbreviatedT<element_type>
, represents a tensor whose elements are of typeelement_type
whereelement_type
can be any of the data types supported in TOSA. -
tensor_list_t
represents a list of tensors. When lists are homogeneous, i.e. contain tensors of the same type, their type is further qualified as follows:tensor_list_t<T<element_type>>
. -
tosa_graph_t
represents a TOSA graph (see Operator Graphs).
Arguments belong to one of three categories: Input, Output, or Attribute. The category to which an argument belongs further constrains its type:
-
An Input argument must be a tensor or a list of tensors used to provide the data read by the operation.
-
An Output argument must be a tensor or a list of tensors into which the data produced by the operation is written.
-
An Attribute argument is constant, i.e. its value is known at compilation time. It may have any data type supported by TOSA.
2.2. Operator Graphs
A TOSA graph is a collection of TOSA operators where:
-
The output of an operator in the graph may be connected to one or more inputs of other operators in the graph
-
When an output is connected to an input the tensor list shapes must match
-
The attributes of the operators are defined and considered part of the graph
-
The attributes must be in the valid range permitted for the operator
-
The tensor dimensions must be in the valid range permitted for the operator
Some operators, such as control flow operators, take a graph of other operators as an attribute. The type tosa_graph_t
will denote a graph of operators and the following functions define the tensor shape list for the graph input and outputs:
shape_list_t tosa_input_shape(tosa_graph_t graph);
shape_list_t tosa_output_shape(tosa_graph_t graph);
Similarly the type tensor_list_t will be used for a list of tensors and the following function returns the shape of a tensor list:
shape_list_t tensor_list_shape(tosa_list_t tensor_list);
The following function denotes the execution of a TOSA graph within a TOSA context, on an input tensor list to produce an output tensor list. A TOSA context, represented by tosa_context_t
provides the environment in which a TOSA graph is executed. Any side-effects that result from the execution of a graph within a context are not observable by graphs executing in a different context. Operators are executed in an implementation-defined order that must be a topological ordering of the TOSA graph.
tosa_execute_graph(tosa_context_t context, tosa_graph_t graph, tosa_list_t input_list, tosa_list_t output_list, tosa_level_t level) {
ERROR_IF(tensor_list_shape(input_list) != tosa_input_shape(graph));
ERROR_IF(tensor_list_shape(output_list) != tosa_output_shape(graph));
// Declare the global list for storing persistent variable tensors across multiple graphs
if (!variable_tensors) {
variable_tensors = list<tensor_t>();
} else { // Clear the "seen flag"
for (tensor_t var_tensor in variable_tensors) {
var_tensor.seen = false;
}
}
for_each(operator in graph order) {
ERROR_IF(operator input tensors do not meet requirement of operator Arguments inputs)
ERROR_IF(operator attributes do not meet requirement of operator Arguments attributes)
ERROR_IF(operator output tensors do not meet requirement of operator Arguments outputs)
ERROR_IF(operator data types do not meet requirement of operator Supported Data Types)
// Execute the operator as defined by the operation function pseduo-code
tosa_execute_operator(context, operator, level);
}
}
2.3. Tensor Operators
2.3.1. ARGMAX
This returns the index with the largest value across the given axis of the input tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis in range from 0 to rank(shape1) - 1 |
Output | T<out_t> | output | shape | 0 to MAX_RANK - 1 | Output tensor, with rank = rank(shape1) - 1 |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 8 | i8_t | i32_t |
Any | signed 16 | i16_t | i32_t |
MI, MT | fp16 | fp16_t | i32_t |
MI, MT | bf16 | bf16_t | i32_t |
MI, MT | fp32 | fp32_t | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape1) <= MAX_RANK);
ERROR_IF(axis < 0 || axis >= rank(shape1));
if (axis == 0) {
left_shape = [];
} else {
left_shape = shape1[0:axis - 1];
}
if (axis == rank(shape1)-1) {
right_shape = [];
} else {
right_shape = shape1[axis+1:rank(shape1) - 1];
}
ERROR_IF(flatten(left_shape, right_shape) != shape);
for_each(left_index in left_shape) {
for_each(right_index in right_shape) {
in_t max_value = minimum_s<in_t>;
out_t max_index = 0;
for (i = 0; i < shape[axis]; i++) {
dim_t index = flatten(left_index, [i], right_index);
in_t value = tensor_read<in_t>(input, shape1, index);
if (apply_max_s<in_t>(value, max_value) != max_value) {
max_value = value;
max_index = i;
}
}
dim_t index = flatten(left_index, right_index);
tensor_write<out_t>(output, shape, index, max_index);
}
}
2.3.2. AVG_POOL2D
This performs an average pooling over the given input tensor. A sliding window of size given by <kernel size> is passed over the input tensor, with the mean value being placed in the output tensor. When calculating the average, only the number of valid input tensor values, but not padding, are used to calculate the divisor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | [N,IH,IW,C] | 4 | Input tensor |
Attribute | T<i32_t> | kernel | [2] | 1 | [kernel_y, kernel_x] |
Attribute | T<i32_t> | stride | [2] | 1 | [stride_y, stride_x] |
Attribute | T<i32_t> | pad | [4] | 1 | [pad_top, pad_bottom, pad_left, pad_right] |
Attribute | T<acc_size_t> | acc_size | - | 0 | Enumerated type, must be one of INT32, FP16, FP32, as defined in the Supported Data Types table for this operation |
Attribute | T<in_out_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<in_out_t> | output_zp | - | 0 | Output tensor zero point. Must be zero for non-int8 types. |
Output | T<in_out_t> | output | [N,OH,OW,C] | 4 | Output tensor 4D |
Supported Data Types:
Profile | Mode | in_out_t | acc_t |
---|---|---|---|
Any | signed 8 with int32 accumulate | i8_t | i32_t |
Any | signed 16 with int32 accumulate | i16_t | i32_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(kernel_y <= MAX_KERNEL);
LEVEL_CHECK(kernel_x <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
LEVEL_CHECK(pad_top <= MAX_KERNEL);
LEVEL_CHECK(pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(pad_left <= MAX_KERNEL);
LEVEL_CHECK(pad_right <= MAX_KERNEL);
ERROR_IF(in_out_t != i8_t && input_zp != 0); // Zero point only for int8_t
ERROR_IF(in_out_t != i8_t && output_zp != 0); // Zero point only for int8_t
ERROR_IF(kernel_y < 1 || kernel_x < 1); // kernel size must be >= 1
ERROR_IF(stride_y < 1 || stride_x < 1);
ERROR_IF(pad_top < 0 || pad_bottom < 0 || pad_left < 0 || pad_right < 0);
// Padding must be less than kernel size to avoid
// a divide-by-zero.
ERROR_IF(pad_right >= kernel_x || pad_left >= kernel_x);
ERROR_IF(pad_top >= kernel_y || pad_bottom >= kernel_y);
ERROR_IF(OH != idiv_check(IH + pad_top + pad_bottom - kernel_y, stride_y) + 1);
ERROR_IF(OW != idiv_check(IW + pad_left + pad_right - kernel_x, stride_x) + 1);
for_each(0 <= n < N, 0 <= oy < OH, 0 <= ox < OW, 0 <= c < C ) {
in_out_t output_val;
acc_t acc = 0;
int count = 0;
index_t iy = oy * stride_y - pad_top;
index_t ix = ox * stride_x - pad_left;
for_each(0 <= ky < kernel_y, 0 <= kx < kernel_x) {
index_t y = iy + ky;
index_t x = ix + kx;
// Only values from the input tensor are used to calculate the
// average, padding does not count
if (0 <= y < IH and 0 <= x < IW) {
count++;
acc_t value = sign_extend<acc_t>(tensor_read<in_out_t>(input, [N,IH,IW,C], [n,y,x,c]));
value = apply_sub_s<acc_t>(value, sign_extend<acc_t>(input_zp));
acc = apply_add_s<acc_t>(acc, value);
}
}
if (is_float(in_out_t)) {
output_val = acc / static_cast<in_out_t>(count);
} else {
scale_t scale = reciprocal_scale(count);
acc = apply_scale_32(acc, scale.multiplier, scale.shift, false);
acc = apply_add_s<acc_t>(acc, sign_extend<acc_t>(output_zp));
acc = apply_clip_s<acc_t>(acc, minimum_s<in_out_t>, maximum_s<in_out_t>);
output_val = static_cast<in_out_t>(acc);
}
tensor_write<in_out_t>(output, [N,OH,OW,C], [n,oy,ox,c], output_val);
}
2.3.3. CONV2D
Performs a 2D convolution over the given tensor input, using the weight tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,IH,IW,IC] | 4 | Input tensor |
Input | T<weight_t> | weight | [OC,KH,KW,IC] | 4 | Weight kernel size KH x KW |
Input | T<out_t> | bias | [BC] | 1 | Per output channel bias data. |
Attribute | T<i32_t> | pad | [4] | 1 | [pad_top, pad_bottom, pad_left, pad_right] |
Attribute | T<i32_t> | stride | [2] | 1 | [stride_y, stride_x] |
Attribute | T<i32_t> | dilation | [2] | 1 | [dilation_y, dilation_x] |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<weight_t> | weight_zp | - | 0 | Weight zero point. Must be zero for non-int8 types. |
Attribute | T<bool_t> | local_bound | - | 0 | This optional attribute affects the floating-point compliance error bound. The default of false allows for direct and transform based, fast convolution algorithms. Only set to true if direct dot-product calculation precision is required. |
Output | T<out_t> | output | [N,OH,OW,OC] | 4 | Output tensor |
Supported Data Types:
Profile | Mode | in_t | weight_t | out_t |
---|---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i8_t | i32_t |
Any | signed 8x4 with int32 accumulate | i8_t | i4_t | i32_t |
Any | signed 16x8 with int48 accumulate | i16_t | i8_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(dilation_y * KH <= MAX_KERNEL);
LEVEL_CHECK(dilation_x * KW <= MAX_KERNEL);
LEVEL_CHECK(pad_top <= MAX_KERNEL);
LEVEL_CHECK(pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(pad_left <= MAX_KERNEL);
LEVEL_CHECK(pad_right <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
ERROR_IF(in_t != i8_t && input_zp != 0); // Zero point only for int8_t
ERROR_IF(weight_t != int8_t && weight_zp != 0);
ERROR_IF(pad_top < 0 || pad_bottom < 0 || pad_left < 0 || pad_right < 0);
ERROR_IF(stride_y < 1 || stride_x < 1);
ERROR_IF(dilation_y < 1 || dilation_x < 1);
ERROR_IF(OH != idiv_check(IH - 1 + pad_top + pad_bottom - (KH - 1) * dilation_y, stride_y) + 1);
ERROR_IF(OW != idiv_check(IW - 1 + pad_left + pad_right - (KW - 1) * dilation_x, stride_x) + 1);
ERROR_IF(BC != OC && BC != 1);
for_each(0 <= n < N, 0 <= oy < OH, 0 <= ox < OW; 0 <= oc < OC) {
out_t acc = 0;
index_t iy = oy * stride_y - pad_top;
index_t ix = ox * stride_x - pad_left;
for_each(0 <= ky < KH, 0 <= kx < KW, 0 <= ic < IC) {
index_t y = iy + ky * dilation_y;
index_t x = ix + kx * dilation_x;
if (0 <= y < IH && 0 <= x < IW) {
out_t value = static_cast<out_t>(tensor_read<in_t>(input,
[N,IH,IW,IC],
[n,y,x,ic]));
out_t weight = static_cast<out_t>(tensor_read<weight_t>(weight,
[OC,KH,KW,IC],
[oc,ky,kx,ic]));
value = apply_sub_s<out_t>(value, static_cast<out_t>(input_zp));
weight = apply_sub_s<out_t>(weight, static_cast<out_t>(weight_zp));
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value, weight));
}
}
acc = apply_add_s<out_t>(acc, bias[(BC == 1) ? 0 : oc]);
tensor_write<out_t>(output, [N,OH,OW,OC], [n,oy,ox,oc], acc);
}
2.3.4. CONV3D
Performs a 3D convolution over the given input tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,ID,IH,IW,IC] | 5 | Input tensor |
Input | T<weight_t> | weight | [OC,KD,KH,KW,IC] | 5 | Weight kernel size KDxKHxKW |
Input | T<out_t> | bias | [BC] | 1 | Per output channel bias data. |
Attribute | T<i32_t> | pad | [6] | 1 | [pad_d0, pad_d1, pad_top, pad_bottom, pad_left, pad_right] |
Attribute | T<i32_t> | stride | [3] | 1 | [stride_d, stride_y, stride_x] |
Attribute | T<i32_t> | dilation | [3] | 1 | [dilation_d, dilation_y, dilation_x] |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<weight_t> | weight_zp | - | 0 | Weight zero point. Must be zero for non-int8 types. |
Attribute | T<bool_t> | local_bound | - | 0 | This optional attribute affects the floating-point compliance error bound. The default of false allows for direct and transform based, fast convolution algorithms. Only set to true if direct dot-product calculation precision is required. |
Output | T<out_t> | output | [N,OD,OH,OW,OC] | 5 | Output tensor |
Supported Data Types:
Profile | Mode | in_t | weight_t | out_t |
---|---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i8_t | i32_t |
Any | signed 8x4 with int32 accumulate | i8_t | i4_t | i32_t |
Any | signed 16x8 with int48 accumulate | i16_t | i8_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(dilation_d * KD <= MAX_KERNEL);
LEVEL_CHECK(dilation_y * KH <= MAX_KERNEL);
LEVEL_CHECK(dilation_x * KW <= MAX_KERNEL);
LEVEL_CHECK(pad_d0 <= MAX_KERNEL);
LEVEL_CHECK(pad_d1 <= MAX_KERNEL);
LEVEL_CHECK(pad_top <= MAX_KERNEL);
LEVEL_CHECK(pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(pad_left <= MAX_KERNEL);
LEVEL_CHECK(pad_right <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
LEVEL_CHECK(stride_d <= MAX_STRIDE);
ERROR_IF(in_t != i8_t && input_zp != 0); // Zero point only for int8_t
ERROR_IF(weight_t != i8_t && weight_zp != 0);
ERROR_IF(pad_d0 < 0 || pad_d1 < 0 || pad_top < 0 || pad_bottom < 0 || pad_left < 0 || pad_right < 0);
ERROR_IF(stride_d < 1 || stride_y < 1 || stride_x < 1);
ERROR_IF(dilation_d < 1 || dilation_y < 1 || dilation_x < 1);
ERROR_IF(OD != idiv_check(ID - 1 + pad_d0 + pad_d1 - (KD - 1) * dilation_d, stride_d) + 1);
ERROR_IF(OH != idiv_check(IH - 1 + pad_top + pad_bottom - (KH - 1) * dilation_y, stride_y) + 1);
ERROR_IF(OW != idiv_check(IW - 1 + pad_left + pad_right - (KW - 1) * dilation_x, stride_x) + 1);
ERROR_IF(BC != OC && BC != 1);
for_each(0 <= n < N, 0 <= od < OD, 0 <= oy < OH, 0 <= ox < OW; 0 <= oc < OC) {
out_t acc = 0;
index_t id = od * stride_d - pad_d0;
index_t iy = oy * stride_y - pad_top;
index_t ix = ox * stride_x - pad_left;
for_each(0 <= kd < KD, 0 <= ky < KH, 0 <= kx < KW, 0 <= ic < IC) {
index_t d = id + kd * dilation_d;
index_t y = iy + ky * dilation_y;
index_t x = ix + kx * dilation_x;
if (0 <= x < IW && 0 <= y < IH && 0 <= d < ID) {
out_t value = static_cast<out_t>(tensor_read<in_t>(input,
[N,ID,IH,IW,IC],
[n,d,y,x,ic]));
out_t weight = static_cast<out_t>(tensor_read<weight_t>(weight,
[OC,KD,KH,KW,IC],
[oc,kd,ky,kx,ic]));
value = apply_sub_s<out_t>(value, static_cast<out_t>(input_zp));
weight = apply_sub_s<out_t>(weight, static_cast<out_t>(weight_zp));
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value, weight));
}
}
acc = apply_add_s<out_t>(acc, bias[(BC == 1) ? 0 : oc]);
tensor_write<out_t>(output, [N,OD,OH,OW,OC], [n,od,oy,ox,oc], acc);
}
2.3.5. DEPTHWISE_CONV2D
Performs 2D convolutions separately over each channel of the given tensor input, using the weight tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,H,W,C] | 4 | Input tensor |
Input | T<weight_t> | weight | [KH,KW,C,M] | 4 | Weight kernel size KH x KW |
Input | T<out_t> | bias | [BC] | 1 | Per output channel bias data. |
Attribute | T<i32_t> | pad | [4] | 1 | [pad_top, pad_bottom, pad_left, pad_right] |
Attribute | T<i32_t> | stride | [2] | 1 | [stride_y, stride_x] |
Attribute | T<i32_t> | dilation | [2] | 1 | [dilation_y, dilation_x] |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<weight_t> | weight_zp | - | 0 | Weight zero point. Must be zero for non-int8 types. |
Attribute | T<bool_t> | local_bound | - | 0 | This optional attribute affects the floating-point compliance error bound. The default of false allows for direct and transform based, fast convolution algorithms. Only set to true if direct dot-product calculation precision is required. |
Output | T<out_t> | output | [N,OH,OW,C*M] | 4 | Output tensor |
Supported Data Types:
Profile | Mode | in_t | weight_t | out_t |
---|---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i8_t | i32_t |
Any | signed 8x4 with int32 accumulate | i8_t | i4_t | i32_t |
Any | signed 16x8 with int48 accumulate | i16_t | i8_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(dilation_y * KH <= MAX_KERNEL);
LEVEL_CHECK(dilation_x * KW <= MAX_KERNEL);
LEVEL_CHECK(pad_top <= MAX_KERNEL);
LEVEL_CHECK(pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(pad_left <= MAX_KERNEL);
LEVEL_CHECK(pad_right <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
ERROR_IF(in_t != i8_t && input_zp != 0); // Zero point only for int8_t
ERROR_IF(weight_t != i8_t && weight_zp != 0);
ERROR_IF(pad_top < 0 || pad_bottom < 0 || pad_left < 0 || pad_right < 0);
ERROR_IF(stride_y < 1 || stride_x < 1);
ERROR_IF(dilation_y < 1 || dilation_x < 1);
ERROR_IF(OH != idiv_check(IH - 1 + pad_top + pad_bottom - (KH - 1) * dilation_y, stride_y) + 1);
ERROR_IF(OW != idiv_check(IW - 1 + pad_left + pad_right - (KW - 1) * dilation_x, stride_x) + 1);
ERROR_IF(BC != C*M && BC != 1);
for_each(0 <= n < N, 0 <= oy < OH, 0 <= ox < OW; 0 <= c < C, 0 <= m < M) {
out_t acc = 0;
index_t iy = oy * stride_y - pad_top;
index_t ix = ox * stride_x - pad_left;
for_each(0 <= ky < KH, 0 <= kx < KW) {
index_t y = iy + ky * dilation_y;
index_t x = ix + kx * dilation_x;
if (0 <= y < IH && 0 <= x < IW) {
out_t value = static_cast<out_t>(tensor_read<in_t>(input,
[N,IH,IW,C],
[n,y,x,c]));
out_t weight = static_cast<out_t>(tensor_read<weight_t>(weight,
[KH,KW,C,M],
[ky,kx,c,m]));
value = apply_sub_s<out_t>(value, static_cast<out_t>input_zp);
weight = apply_sub_s<out_t>(weight, static_cast<out_t>weight_zp);
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value, weight));
}
}
acc = apply_add_s<out_t>(acc, bias[(BC == 1) ? 0 : (c * M) + m]);
tensor_write<out_t>(output, [N,OH,OW,C * M], [n,oy,ox,c * M + m], acc);
}
2.3.6. FFT2D
Performs a batched complex 2D Fast Fourier Transform over the input. The complex input values are constructed from the corresponding values in the input_real and input_imag tensors. The resulting values in the output are split into the output_real and output_imag tensors. No normalization is applied on either the forward or inverse versions of the operation.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input_real | [N,H,W] | 3 | Real part of the complex input. H,W must be powers of two. |
Input | T<in_out_t> | input_imag | [N,H,W] | 3 | Imaginary part of the complex input. H,W must be powers of two. |
Attribute | T<bool_t> | inverse | - | 0 | false for forward FFT, true for inverse FFT |
Output | T<in_out_t> | output_real | [N,H,W] | 3 | Real part of the complex output. |
Attribute | T<bool_t> | local_bound | - | 0 | This optional attribute affects the floating-point compliance error bound. The default of false allows for direct and transform based, fast convolution algorithms. Only set to true if direct dot-product calculation precision is required. |
Output | T<in_out_t> | output_imag | [N,H,W] | 3 | Imaginary part of the complex output. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(H <= MAX_KERNEL);
LEVEL_CHECK(W <= MAX_KERNEL);
ERROR_IF(!power_of_two(H));
ERROR_IF(!power_of_two(W));
float sign_val = 1.0;
if (inverse) {
sign_val = -1.0;
}
for_each(0 <= n < N, 0 <= oy < H, 0 <= ox < W) {
in_out_t sum_real = 0.0;
in_out_t sum_imag = 0.0;
for_each(0 <= iy < H, 0 <= ix < W) {
in_out_t val_real = tensor_read<in_out_t>(input_real, [N,H,W], [n,iy,ix]);
in_out_t val_imag = tensor_read<in_out_t>(input_imag, [N,H,W], [n,iy,ix]);
float_t a = sign_val * 2 * pi() * ((iy * oy) / H + (ix * ox) / W);
sum_real += val_real * cos(a) + val_imag * sin(a);
sum_imag += -val_real * sin(a) + val_imag * cos(a);
}
tensor_write<in_out_t>(output_real, [N,H,W], [n,oy,ox], sum_real);
tensor_write<in_out_t>(output_imag, [N,H,W], [n,oy,ox], sum_imag);
}
2.3.7. FULLY_CONNECTED
Performs a fully connected network.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,IC] | 2 | Input tensor |
Input | T<weight_t> | weight | [OC,IC] | 2 | Weights |
Input | T<out_t> | bias | [BC] | 1 | Per output channel bias data. |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<weight_t> | weight_zp | - | 0 | Weight zero point. Must be zero for non-int8 types. |
Output | T<out_t> | output | [N,OC] | 2 | Output tensor |
Supported Data Types:
Profile | Mode | in_t | weight_t | out_t |
---|---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i8_t | i32_t |
Any | signed 8x4 with int32 accumulate | i8_t | i4_t | i32_t |
Any | signed 16x8 with int48 accumulate | i16_t | i8_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t | fp32_t |
Operation Function:
ERROR_IF(in_t != i8_t && input_zp != 0); // Zero point only for int8_t
ERROR_IF(weight_t != i8_t && weight_zp != 0);
ERROR_IF(BC != OC && BC != 1);
for_each(0 <= n < N, 0 <= oc < OC) {
out_t acc = 0;
for_each(0 <= ic < IC) {
out_t value = static_cast<out_t>(tensor_read<in_t>(input, [N,IC], [n,ic]));
out_t weight = static_cast<out_t>(tensor_read<weight_t>(weight, [OC,IC], [oc,ic]));
value = apply_sub_s<out_t>(value, static_cast<out_t>(input_zp));
weight = apply_sub_s<out_t>(weight, static_cast<out_t>(weight_zp));
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value, weight));
}
acc = apply_add_s<out_t>(acc, bias[(BC == 1) ? 0 : oc]);
tensor_write<out_t>(output, [N,OC], [n,oc], acc);
}
2.3.8. MATMUL
Performs two dimensional matrix multiplications. This allows both inputs to be activations, rather than reserving weights as an attribute in the FULLY_CONNECTED operator.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | A | [N,H,C] | 3 | Input tensor A, N matrices of size HxC |
Input | T<in_t> | B | [N,C,W] | 3 | Input tensor B, N matrices of size CxW |
Attribute | T<in_t> | A_zp | - | 0 | Input tensor A zero point. Must be zero for non-int8 types. |
Attribute | T<in_t> | B_zp | - | 0 | Input tensor B zero point. Must be zero for non-int8 types. |
Output | T<out_t> | output | [N,H,W] | 3 | Output tensor, N matrices of size HxW |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i32_t |
Any | signed 16x16 with int48 accumulate | i16_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t |
Operation Function:
ERROR_IF(in_t != i8_t && (A_zp != 0 || B_zp != 0)); // Zero point only for int8_t
for_each(0 <= n < N, 0 <= h < H, 0 <= w < W) {
out_t acc = 0;
for_each(0 <= c < C) {
out_t value1 = static_cast<out_t>(tensor_read<in_t>(A, [N,H,C], [n,h,c]));
out_t value2 = static_cast<out_t>(tensor_read<in_t>(B, [N,C,W], [n,c,w]));
value1 = apply_sub_s<out_t>(value1, static_cast<out_t>(A_zp));
value2 = apply_sub_s<out_t>(value2, static_cast<out_t>(B_zp));
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value1 * value2));
}
tensor_write<out_t>(output, [N,H,W], [n,h,w], acc);
}
2.3.9. MAX_POOL2D
This performs a max pooling over the given input tensor. A sliding window of size given by <kernel size> is passed over the input tensor, with the maximum value being placed in the output tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | [N,IH,IW,C] | 4 | Input tensor 4D |
Attribute | T<i32_t> | kernel | [2] | 1 | [kernel_y, kernel_x] |
Attribute | T<i32_t> | stride | [2] | 1 | [stride_y, stride_x] |
Attribute | T<i32_t> | pad | [4] | 1 | [pad_top, pad_bottom, pad_left, pad_right] |
Output | T<in_out_t> | output | [N,OH,OW,C] | 4 | Output tensor 4D |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(kernel_y <= MAX_KERNEL);
LEVEL_CHECK(kernel_x <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
LEVEL_CHECK(pad_top <= MAX_KERNEL);
LEVEL_CHECK(pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(pad_left <= MAX_KERNEL);
LEVEL_CHECK(pad_right <= MAX_KERNEL);
ERROR_IF(kernel_y < 1 || kernel_x < 1); // kernel size must be >= 1
ERROR_IF(stride_y < 1 || stride_x < 1);
ERROR_IF(pad_top < 0 || pad_bottom < 0 || pad_left < 0 || pad_right < 0);
// Padding must be less than kernel size, otherwise no
// input values will be used.
ERROR_IF(pad_right >= kernel_x || pad_left >= kernel_x);
ERROR_IF(pad_top >= kernel_y || pad_bottom >= kernel_y);
ERROR_IF(OH != idiv_check(IH + pad_top + pad_bottom - kernel_y, stride_y) + 1);
ERROR_IF(OW != idiv_check(IW + pad_left + pad_right - kernel_x, stride_x) + 1);
for_each(0 <= n < N, 0 <= oy < H, 0 <= ox < W, 0 <= c < C ) {
in_out_t acc = minimum_value<in_out_t>;
index_t iy = oy * stride_y - pad_top;
index_t ix = ox * stride_x - pad_left;
for_each( 0 <= ky < kernel_y, 0 <= kx < kernel_x ) {
index_t y = iy + ky;
index_t x = ix + kx;
if (y >= 0 && y < IH && x >= 0 && x < IW) {
in_out_t value = tensor_read<in_out_t>(input, [N,IH,IW,C], [n,y,x,c]);
acc = apply_max_s<in_out_t>(acc, value);
}
}
tensor_write<in_out_t>(output, [N,OH,OW,C], [n,oy,ox,c], acc);
}
2.3.10. RFFT2D
Performs a batched 2D real-valued Fast Fourier Transform over the input where the input tensor consists of real values producing complex valued output. The complex output values will be split into the output_real and output_imag tensor arguments. RFFT2D takes advantage of Hermitian symmetry to only calculate the first half of the final output axis. Imaginary values with locations (0,0), (0,W/2), (H/2,0) and (H/2,W/2) are zero.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | [N,H,W] | 3 | Real input. H,W must be powers of two. |
Output | T<in_out_t> | output_real | [N,H,W/2 + 1] | 3 | Real part of the complex output |
Output | T<in_out_t> | output_imag | [N,H,W/2 + 1] | 3 | Imaginary part of the complex output. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(H <= MAX_KERNEL);
LEVEL_CHECK(W <= MAX_KERNEL);
ERROR_IF(!power_of_two(H));
ERROR_IF(!power_of_two(W));
for_each(0 <= n < N, 0 <= oy < H, 0 <= ox < W/2 + 1) {
in_out_t sum_real = 0.0;
in_out_t sum_imag = 0.0;
for_each(0 <= iy < H, 0 <= ix < W) {
in_out_t val_real = tensor_read<in_out_t>(input_real, [N,H,W], [n,iy,ix]);
float_t a = 2 * pi() * ((iy * oy) / H + (ix * ox) / W);
sum_real += val_real * cos(a);
sum_imag += -val_real * sin(a);
}
tensor_write<in_out_t>(output_real, [N,H,W], [n,oy,ox], sum_real);
tensor_write<in_out_t>(output_imag, [N,H,W], [n,oy,ox], sum_imag);
}
2.3.11. TRANSPOSE_CONV2D
Performs a 2D transposed convolution over the given tensor input, using the weights tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,IH,IW,IC] | 4 | Input tensor |
Input | T<weight_t> | weight | [OC,KH,KW,IC] | 4 | Weight kernel size KH x KW |
Input | T<out_t> | bias | [BC] | 1 | Per output channel bias data. |
Attribute | T<i32_t> | out_pad | [4] | 1 | [out_pad_top, out_pad_bottom, out_pad_left, out_pad_right] |
Attribute | T<i32_t> | stride | [2] | 1 | [stride_y, stride_x] |
Attribute | T<i32_t> | out_shape | [4] | 1 | [N,OH,OW,OC] |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. Must be zero for non-int8 types. |
Attribute | T<weight_t> | weight_zp | - | 0 | Weight zero point. Must be zero for non-int8 types. |
Attribute | T<bool_t> | local_bound | - | 0 | This optional attribute affects the floating-point compliance error bound. The default of false allows for direct and transform based, fast convolution algorithms. Only set to true if direct dot-product calculation precision is required. |
Output | T<out_t> | output | [N,OH,OW,OC] | 4 | Output tensor |
Supported Data Types:
Profile | Mode | in_t | weight_t | out_t |
---|---|---|---|---|
Any | signed 8x8 with int32 accumulate | i8_t | i8_t | i32_t |
Any | signed 8x4 with int32 accumulate | i8_t | i4_t | i32_t |
Any | signed 16x8 with int48 accumulate | i16_t | i8_t | i48_t |
MI, MT | fp16 with fp16 accumulate | fp16_t | fp16_t | fp16_t |
MI, MT | fp16 with fp32 accumulate | fp16_t | fp16_t | fp32_t |
MI, MT | bf16 with fp32 accumulate | bf16_t | bf16_t | fp32_t |
MI, MT | fp32 with fp32 accumulate | fp32_t | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(KH <= MAX_KERNEL);
LEVEL_CHECK(KW <= MAX_KERNEL);
LEVEL_CHECK(out_pad_top <= MAX_KERNEL);
LEVEL_CHECK(out_pad_bottom <= MAX_KERNEL);
LEVEL_CHECK(out_pad_left <= MAX_KERNEL);
LEVEL_CHECK(out_pad_right <= MAX_KERNEL);
LEVEL_CHECK(stride_y <= MAX_STRIDE);
LEVEL_CHECK(stride_x <= MAX_STRIDE);
ERROR_IF(in_t != i8_t && input_zp != 0); // Zero point only allowed for int8_t
ERROR_IF(weight_t != i8_t && weight_zp != 0);
ERROR_IF(out_pad_top <= -KH || out_pad_bottom <= -KH);
ERROR_IF(out_pad_left <= -KW || out_pad_right <= -KW);
ERROR_IF(stride_y < 1 || stride_x < 1);
ERROR_IF(OH != (IH - 1) * stride_y + out_pad_top + out_pad_bottom + KH);
ERROR_IF(OW != (IW - 1) * stride_x + out_pad_left + out_pad_right + KW);
ERROR_IF(BC != OC && BC != 1);
for_each(index in out_shape) {
tensor_write<out_t>(output, [N,OH,OW,OC], index, bias[(BC == 1) ? 0 : index[3]])
}
for_each(0 <= n < N, 0 <= iy < IH, 0 <= ix < IW, 0 <= oc < OC,
0 <= ic < IC, 0 <= ky < KH, 0 <= kx < KW) {
index_t oy = iy * stride_y + out_pad_top + ky;
index_t ox = ix * stride_x + out_pad_left + kx;
if (oy >= 0 && oy < OH && ox >= 0 && ox < OW) {
out_t acc = static_cast<out_t>(tensor_read<out_t>(output, [N,OH,OW,OC], [n,oy,ox,oc]));
out_t value = static_cast<out_t>(tensor_read<in_t>(input, [N,IH,IW,IC], [n,iy,ix,ic]));
out_t weight = static_cast<out_t>(tensor_read<weight_t>(weight, [OC,KH,KW,IC], [oc,ky,kx,ic]));
value = apply_sub_s<out_t>(value, static_cast<out_t>(input_zp));
weight = apply_sub_s<out_t>(weight, static_cast<out_t>(weight_zp));
acc = apply_add_s<out_t>(acc, apply_mul_s<out_t>(value, weight));
tensor_write<out_t>(output, [N,OH,OW,OC], [n,oy,ox,oc], acc);
}
}
2.4. Activation Functions
2.4.1. CLAMP
Clamp to an arbitrary minimum and maximum value. Maximum and minimum values are specified as values in the range of the input type. No zero point subtraction is done to the values, thus to clamp to the zero point value, the zero point itself should be supplied as the minimum value.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape | 0 to MAX_RANK | Input tensor |
Attribute | T<in_out_t> | min_val | - | 0 | Minimum clip value |
Attribute | T<in_out_t> | max_val | - | 0 | Maximum clip value |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type and shape as input |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(max_val < min_val);
for_each(index in shape) {
in_out_t value = tensor_read<in_out_t>(input, shape, index);
value = apply_clip<in_out_t>(value, min_val, max_val);
tensor_write<in_out_t>(output, shape, index, value);
}
2.4.2. ERF
Error function:
For quantized integer data types, the TABLE operator should be used instead with the following definition.
The ERF table has 513 entries each of 16-bit precision and covering the input range -4.0 to +4.0 in steps of 1/64.
int16_t erf_reference(int16_t x) { // input x range is -256 to + 256 inclusive
F64 v = (double)x / (double)64;
v = erf(v);
return round_to_nearest_int(32768.0 * v);
}
generate_lookup_table(&erf_table, &erf_reference);
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type and shape as input |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
2.4.3. SIGMOID
Applies the sigmoid logistic function to each element of the input tensor.
For quantized integer data types, the TABLE operator should be used instead. Each implementation may choose an appropriate TABLE given the scale and zero point of the input data. Eight or sixteen bit precision tables may be used based on the input tensor to the sigmoid function. Below we give an example table generation for 16-bit sigmoid. This sigmoid table has 513 entries each of 16-bit precision and covering the input range -16.0 to +16.0 in steps of 1/16.
int16_t sigmoid_reference(int16_t x) { // input x range is -256 to + 256 inclusive
fp64_t v = (fp64_t)x / (fp64_t)16;
v = 1.0/(1.0 + exp(-v));
return round_to_nearest_int(32768.0 * v);
}
generate_lookup_table(&sigmoid_table, &sigmoid_reference);
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type and shape as input |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input, shape, index);
value = sigmoid<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, value);
}
2.4.4. TANH
Parameterized hyperbolic tangent.
For quantized integer data types, the TABLE operator should be used instead. Each implementation may choose an appropriate TABLE given the scale and zero point of the input data. Eight or sixteen bit precision tables may be used based on the input tensor to the sigmoid function. Below we give an example table generation for 16-bit hyperbolic tangent. This tanh_table has 513 entries each of 16-bit precision and covering the input range -8.0 to +8.0 in steps of 1/32.
int16_t tanh_reference(int16_t x) { // input x range is -256 to +256 inclusive
fp64_t v = (fp64_t)x/(fp64_t)32;
v = exp(-2.0*v);
v = (1.0-v)/(1.0+v);
return round_to_nearest_int(32768.0 * v);
}
generate_lookup_table(&tanh_table, &tanh_reference);
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type and shape as input |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input, shape, index);
value = tanh<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, value);
}
2.5. Elementwise Binary Operators
2.5.1. ADD
Elementwise addition of input1 and input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
Any | shape | shape_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
if (in_out_t == shape_t) {
ERROR_IF(rank(shape) != 0 || rank(shape1) != 0 || rank(shape2) != 0);
shape_t value1 = tensor_read<shape_t>(input1, [], []);
shape_t value2 = tensor_read<shape_t>(input2, [], []);
shape_t result = apply_add_s<shape_t>(value1, value2);
tensor_write<shape_t>(output, [], [], result);
} else {
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = apply_add_s<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
}
2.5.2. ARITHMETIC_RIGHT_SHIFT
Elementwise arithmetic right shift of input1 by the amount specified in input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Attribute | T<bool_t> | round | - | 0 | If true then the shift is rounded |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
// Ensure that shift amount is appropriate for the data type
REQUIRE((in_out_t == i32_t && 0 <= value2 && value2 <= 31) ||
(in_out_t == i16_t && 0 <= value2 && value2 <= 15) ||
(in_out_t == i8_t && 0 <= value2 && value2 <= 7));
in_out_t result = apply_arith_rshift<in_out_t>(value1, value2);
if (round == true && static_cast<int32_t>(value2) > 0 &&
(apply_arith_rshift<in_out_t>(value1, apply_sub_s<in_out_t>(value2, 1)) & 1 != 0) {
result = result + 1;
}
result = apply_clip_s<in_out_t>(result, minimum_s<in_out_t>, maximum_s<in_out_t>);
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.3. BITWISE_AND
Elementwise bitwise AND of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 & value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.4. BITWISE_OR
Elementwise bitwise OR of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 | value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.5. BITWISE_XOR
Elementwise bitwise XOR of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 ^ value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.6. INTDIV
Elementwise integer divide of input1 by input2. The result of the divide is truncated towards zero. Expected use is for operations on non-scaled integers. Floating point divide should use RECIPROCAL and MUL. Quantized integer divide should use TABLE (for 1/x) and MUL.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
Any | shape | shape_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
if (in_out_t == shape_t) {
ERROR_IF(rank(shape) != 0 || rank(shape1) != 0 || rank(shape2) != 0);
shape_t value1 = tensor_read<shape_t>(input1, [], []);
shape_t value2 = tensor_read<shape_t>(input2, [], []);
REQUIRE(value2 != 0);
shape_t result = value1 / value2;
tensor_write<shape_t>(output, [], [], result);
} else {
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
REQUIRE(value2 != 0);
// This catches the case where we divide minimum<in_out_t> by -1
// which is not representable in two's complement
REQUIRE(static_cast<int64_t>(value1) / static_cast<int64_t>(value2) <= maximum_s<in_out_t>);
in_out_t result = apply_intdiv_s<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
}
2.5.7. LOGICAL_AND
Elementwise logical AND of input1 and input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 && value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.8. LOGICAL_LEFT_SHIFT
Elementwise logical left shift of input1 by the amount specified in input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
REQUIRE(0 <= value2 && value2 <= 31);
in_out_t result = value1 << value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.9. LOGICAL_RIGHT_SHIFT
Elementwise logical right shift of input1 by the amount specified in input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
REQUIRE(0 <= static_cast<int32_t>(value2) && static_cast<int32_t>(value2) <= 31);
// Logical shifts happen as unsigned types internally
in_out_t result = apply_logical_rshift<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.10. LOGICAL_OR
Elementwise logical OR of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 || value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.11. LOGICAL_XOR
Elementwise logical XOR of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = value1 != value2;
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.12. MAXIMUM
Elementwise max of input1 and input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = apply_max_s<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.13. MINIMUM
Elementwise minimum of input1 and input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = apply_min_s(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.14. MUL
Elementwise multiplication (Hadamard product) of input1 and input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Input (MT profile) Attribute (BI/MI profiles) | T<i8_t> | shift | - | 0 | Result right shift (i32_t data type only) |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 8 | i8_t | i32_t |
Any | signed 16 | i16_t | i32_t |
Any | signed 32 | i32_t | i32_t |
Any | shape | shape_t | shape_t |
MI, MT | fp16 | fp16_t | fp16_t |
MI, MT | bf16 | bf16_t | bf16_t |
MI, MT | fp32 | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
if (in_out_t == shape_t) {
ERROR_IF(rank(shape) != 0 || rank(shape1) != 0 || rank(shape2) != 0);
shape_t value1 = tensor_read<shape_t>(input1, [], []);
shape_t value2 = tensor_read<shape_t>(input2, [], []);
shape_t result = value1 * value2;
tensor_write<shape_t>(output, [], [], result);
} else {
REQUIRE(0 <= shift && shift <= 63);
REQUIRE(in_t == int32_t || shift == 0);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_t value1 = tensor_read<in_t>(input1, shape1, index1);
in_t value2 = tensor_read<in_t>(input2, shape2, index2);
out_t result;
if (in_t == i32_t && shift > 0) {
int64_t product = sign_extend<int64_t>(value1) * sign_extend<int64_t>(value2);
int64_t round = static_cast<int64_t>(1) << (shift - 1);
product = (product + round) >> shift;
REQUIRE(product >= minimum_s<i32_t> && product <= maximum_s<i32_t>)
result = product;
} else {
result = apply_mul_s(value1, value2); // low 32-bits of result for i32_t
}
tensor_write<out_t>(output, shape, index, result);
}
}
2.5.15. POW
Elementwise input1 value raised to the power of input2. Axis of size 1 will be broadcast, as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = apply_pow<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
2.5.16. SUB
Elementwise subtraction of input1 and input2. Axis of size 1 will be broadcast as necessary. Rank of input tensors must match.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
Any | shape | shape_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
if (in_out_t == shape_t) {
ERROR_IF(rank(shape) != 0 || rank(shape1) != 0 || rank(shape2) != 0);
shape_t value1 = tensor_read<shape_t>(input1, [], []);
shape_t value2 = tensor_read<shape_t>(input2, [], []);
shape_t result = apply_sub<shape_t>(value1, value2);
tensor_write<shape_t>(output, [], [], result);
} else {
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t result = apply_sub_s<in_out_t>(value1, value2);
tensor_write<in_out_t>(output, shape, index, result);
}
}
2.5.17. TABLE
Table lookup operation. For int8_t TABLE operation, perform a 256 entry table lookup returning an int8_t value. For int16_t tables, the int16_t input is treated as a fixed-point 9.7 value. The most significant 9 bits are used to index into the table. The fractional 7 bits are used to interpolate based on table[index] and table[index+1]. For int16_t inputs, the TABLE operator returns a 16.7 interpolated value in an int32_t. This value can then be input to the RESCALE operator to scale to the required output data type. Note that int16_t table has 513 values to handle table[index+1] when index=511.
An int16_t to int16_t table lookup can be constructed in TOSA as follows:
-
Use the TABLE operator to produce a fixed point 16.7 interpolated result
-
Use RESCALE (in_t=int32_t, out_t=int16_t, scale=1<<14, shift=21) to scale the output to int16_t range (or alternate scale as required)
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | shape | 0 to MAX_RANK | Input tensor |
Input (MT profile) Attribute (BI/MI profiles) | T<table_t> | table | [TABLE_SIZE] | 1 | Lookup table tensor |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor |
Supported Data Types:
Profile | Mode | in_t | table_t | out_t | TABLE_SIZE |
---|---|---|---|---|---|
Any | signed 8 | i8_t | i8_t | i8_t | 256 |
Any | signed 16 | i16_t | i16_t | i32_t | 513 |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
REQUIRE(length(table) == TABLE_SIZE);
for_each(index in shape) {
in_t value = tensor_read<in_t>(input, shape, index);
out_t result;
if (in_t == i8_t) {
// value is a signed int, convert to a 0 based index
result = table[static_cast<int16_t>(value) + 128];
} else {
result = apply_lookup_s(static_cast<int16_t>(table), static_cast<int16_t>(value));
}
tensor_write<out_t>(output, shape, index, result);
}
2.6. Elementwise Unary Operators
2.6.1. ABS
Elementwise absolute value operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | +infinity | +infinity | +0 | +0 | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
if (is_floating_point(in_out_t) && value1 == -0.0) {
value1 = 0.0;
}
if (static_cast<int32_t>(value1) < 0.0) {
value1 = apply_sub_s<in_out_t>(0, value1);
}
tensor_write<in_out_t>(output, shape, index, value1);
}
2.6.2. BITWISE_NOT
Elementwise bitwise NOT of input tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = ~value1;
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.3. CEIL
Elementwise ceiling operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | -infinity | +infinity | -0 | +0 | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = apply_ceil<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.4. CLZ
Elementwise count leading zeros operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = count_leading_zeros(value1);
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.5. EXP
Elementwise e to the x operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | +0 | +infinity | 1 | 1 | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = apply_exp<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.6. FLOOR
Elementwise floor operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | -infinity | +infinity | -0 | +0 | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = apply_floor<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.7. LOG
Elementwise natural logarithm operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | NaN | +infinity | -infinity | -infinity | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
in_out_t result = apply_log<in_out_t>(value1);
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.8. LOGICAL_NOT
Elementwise logical NOT of input.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | Boolean | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index);
in_out_t result = !value1;
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.9. NEGATE
Elementwise negation operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Attribute | T<in_out_t> | input1_zp | - | 0 | Input 1 zero point. Must be zero for non-int8 types. |
Attribute | T<in_out_t> | output_zp | - | 0 | Output zero point. Must be zero for non-int8 types. |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t | acc_t |
---|---|---|---|
Any | signed 8 | i8_t | i32_t |
Any | signed 16 | i16_t | i32_t |
Any | signed 32 | i32_t | i32_t |
MI, MT | fp16 | fp16_t | fp16_t |
MI, MT | bf16 | bf16_t | bf16_t |
MI, MT | fp32 | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | +infinity | -infinity | +0 | -0 | NaN |
ERROR_IF(in_out_t != i8_t && input1_zp != 0) // Zero point only for int8_t
ERROR_IF(in_out_t != i8_t && output_zp != 0) // Zero point only for int8_t
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape, index);
acc_t value = apply_sub_s<acc_t>(sign_extend<acc_t>(value1),
sign_extend<acc_t>(input1_zp));
value = apply_sub_s<acc_t>(0, value);
value = apply_add_s<acc_t>(value, sign_extend<acc_t>(output_zp));
in_out_t result = truncate<in_out_t>(apply_clip_s<acc_t>(value,
minimum_s<in_out_t>,
maximum_s<in_out_t>));
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.10. RECIPROCAL
Elementwise reciprocal operation. For integer operation, a TABLE should be used with the appropriate ranges.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | -0 | +0 | -infinity | +infinity | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index);
in_out_t result = 1.0 / value1;
tensor_write<in_out_t>(output, shape, index, result);
}
2.6.11. RSQRT
Elementwise reciprocal square root operation. For integer operation, a TABLE should be used with the appropriate ranges.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
Floating-point behavior:
Input | -infinity | +infinity | -0 | +0 | NaN |
---|---|---|---|---|---|
Output | NaN | +0 | -infinity | +infinity | NaN |
for_each(index in shape) {
in_out_t value1 = tensor_read<in_out_t>(input1, shape1, index);
in_out_t result;
if (value1 < 0) {
result = NaN;
}
else {
result = 1.0 / apply_sqrt<in_out_t>(value1);
}
tensor_write<in_out_t>(output, shape, index, result);
}
2.7. Elementwise Ternary Operators
2.7.1. SELECT
Elementwise select of the output based on a condition.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<bool_t> | input1 | shape1 | 0 to MAX_RANK | Input selector tensor |
Input | T<in_out_t> | input2 | shape2 | 0 to MAX_RANK | Input value tensor if input1 is True |
Input | T<in_out_t> | input3 | shape3 | 0 to MAX_RANK | Input value tensor if input1 is False |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of same type as input2 and input3, with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | Boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(broadcast_shape(shape1, shape2), shape3));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
dim_t index3 = apply_broadcast(shape, shape3, index);
bool_t value1 = tensor_read<bool_t>(input1, shape1, index1);
in_out_t value2 = tensor_read<in_out_t>(input2, shape2, index2);
in_out_t value3 = tensor_read<in_out_t>(input3, shape3, index3);
in_out_t result;
if (value1) {
result = value2;
} else {
result = value3;
}
tensor_write<in_out_t>(output, shape, index, result);
}
2.8. Comparison Operators
2.8.1. EQUAL
Elementwise comparison operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 32 | i32_t | bool_t |
MI, MT | fp16 | fp16_t | bool_t |
MI, MT | bf16 | bf16_t | bool_t |
MI, MT | fp32 | fp32_t | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_t value1 = tensor_read<in_t>(input1, shape1, index1);
in_t value2 = tensor_read<in_t>(input2, shape2, index2);
out_t result;
if (isNaN(value1) || isNaN(value2))
result = False;
else
result = (value1 == value2) ? True : False;
tensor_write<out_t>(output, shape, index, result);
}
2.8.2. GREATER
Elementwise greater than comparison operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 32 | i32_t | bool_t |
MI, MT | fp16 | fp16_t | bool_t |
MI, MT | bf16 | bf16_t | bool_t |
MI, MT | fp32 | fp32_t | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_t value1 = tensor_read<in_t>(input1, shape1, index1);
in_t value2 = tensor_read<in_t>(input2, shape2, index2);
out_t result;
if (isNaN(value1) || isNaN(value2))
result = False;
else
result = (value1 > value2) ? True : False;
tensor_write<out_t>(output, shape, index, result);
}
2.8.3. GREATER_EQUAL
Elementwise comparison operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input1 | shape1 | 0 to MAX_RANK | Input tensor |
Input | T<in_t> | input2 | shape2 | 0 to MAX_RANK | Input tensor with the same rank as input1 |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor with broadcast shape if necessary |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | signed 32 | i32_t | bool_t |
MI, MT | fp16 | fp16_t | bool_t |
MI, MT | bf16 | bf16_t | bool_t |
MI, MT | fp32 | fp32_t | bool_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(shape != broadcast_shape(shape1, shape2));
for_each(index in shape) {
dim_t index1 = apply_broadcast(shape, shape1, index);
dim_t index2 = apply_broadcast(shape, shape2, index);
in_t value1 = tensor_read<in_t>(input1, shape1, index1);
in_t value2 = tensor_read<in_t>(input2, shape2, index2);
out_t result;
if (isNaN(value1) || isNaN(value2))
result = False;
else
result = (value1 >= value2) ? True : False;
tensor_write<out_t>(output, shape, index, result);
}
2.9. Reduction Operators
2.9.1. REDUCE_ALL
Reduce a tensor along the given axis with a logical AND operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
// Initialize output state to true
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, true);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = state && value;
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.9.2. REDUCE_ANY
Reduce a tensor along the given axis with a logical OR operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
// Initialize output state to false
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, false);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = state || value;
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.9.3. REDUCE_MAX
Reduce a tensor along the given axis with a maximum operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, minimum<in_out_t>);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = apply_max_s<in_out_t>(state, value);
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.9.4. REDUCE_MIN
Reduce a tensor along the given axis with a minimum operation
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, maximum<in_out_t>);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = apply_min_s<in_out_t>(state, value);
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.9.5. REDUCE_PRODUCT
Reduce a tensor along the given axis by computing the product of the axis.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, 1.0);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = apply_mul_s<in_out_t>(state, value);
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.9.6. REDUCE_SUM
Reduce a tensor along the given axis by computing the sum of the axis.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape1 | 1 to MAX_RANK | Input tensor with rank from 1 to 4 |
Attribute | T<i32_t> | axis | - | 0 | Axis to reduce, in range from 0 to rank(shape1)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same rank as the input tensor. |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
ERROR_IF(axis < 0 || axis >= rank(shape1));
ERROR_IF(shape[axis] != 1);
for_each(index in shape) {
tensor_write<in_out_t>(output, shape, index, 0);
}
for_each(index in shape1) {
dim_t out_index = index;
out_index[axis] = 0;
in_out_t value = tensor_read<in_out_t>(input, shape1, index);
in_out_t state = tensor_read<in_out_t>(output, shape, out_index);
state = apply_add_s<in_out_t>(state, value);
tensor_write<in_out_t>(output, shape, out_index, state);
}
2.10. Data Layout
2.10.1. CONCAT
Concatenate a list of tensors along a given axis. No data conversion happens during a concat operation.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | tensor_list_t<T<in_out_t>> | input1 | shapes1 | 0 to MAX_RANK | List of input tensors. All inputs must have the same rank and data type |
Attribute | T<i32_t> | axis | - | 0 | Axis along which concatenation is to occur, in range from 0 to rank(shape)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Any | shape | shape_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(axis < 0 || axis >= max(1,rank(shapes1[0])));
ERROR_IF(shape[axis] != sum(shape_dim(shapes1[k], axis) for all k))
ERROR_IF(in_out_t == shape_t && rank(shape) > 1);
// The following checks ensure all inputs are compatible for concatenation
for_each(input_shape in shapes1) {
ERROR_IF(rank(input_shape) != rank(shapes1[0]));
for_each(index in input_shape) {
ERROR_IF(index != axis && input_shape[index] != shapes1[0][index]);
}
}
for_each(index1 in shape) {
dim_t index2 = index1;
for (tensor t = 0; t < length(input1); t++) {
// Continue to concatenate along axis from each tensor
// For each output location, we are looking for the
// appropriate input tensor
if (index2[axis] >= 0 && index2[axis] < shape_dim(shapes1[t], axis)) {
in_out_t value = tensor_read<in_out_t>(input1[t], shapes1[t], index2);
tensor_write<in_out_t>(output, shape, index1, value);
}
index2[axis] = index2[axis] - shape_dim(shapes1[t], axis);
}
}
2.10.2. PAD
Pads a tensor along the borders of each dimension with a supplied value. Returns a new tensor with the padding included. The pad_const value includes the zero point if the tensor uses a zero point.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 1 to MAX_RANK | Input tensor |
Input | T<shape_t> | padding | [rank(shape1),2] | 2 | Number of pad elements at the start and end of each dimension |
Attribute | T<in_out_t> | pad_const | - | 0 | Constant value to be used as padding |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor of same type as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
// Check output shape matches the padded input shape
ERROR_IF(rank(shape) != rank(shape1));
for (i = 0; i < rank(shape); i++) {
ERROR_IF(padding[i,0] < 0 || padding[i,1] < 0);
ERROR_IF(shape[i] != padding[i, 0] + shape1[i] + padding[i, 1]);
}
for_each(index in shape) {
dim_t index1 = index;
bool_t is_pad = false;
for(i = 0; i < rank(shape); i++) {
index1[i] = index1[i] - padding[i,0];
if (index1[i] < 0 || index[i] >= length(shape[i])) {
is_pad = true;
}
}
in_out_t value = is_pad ? pad_const : tensor_read<in_out_t>(input1, shape1, index1);
tensor_write<in_out_t>(output, shape, index, value);
}
2.10.3. DIM
Returns a rank 0 tensor of the size of the input tensor for the given axis.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input1 | shape | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis in range from 0 to rank(shape) - 1 |
Output | T<shape_t> | output | - | 0 | Output rank 0 tensor giving the size of the shape for the given axis |
Supported Data Types:
Profile | Mode | in_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(axis >= rank(shape));
tensor_write<shape_t>(output, [], [], shape_dim(shape, axis));
2.10.4. RESHAPE
Returns a tensor with the same type/values as the input, with a new shape specified by the shape argument. Reshape may operate on tensors of any rank. No data conversion happens during a reshape operation.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 1 to MAX_RANK | Input tensor |
Input | T<shape_t> | shape | [rank(shape)] | 1 | 1D shape tensor giving the new shape. |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor of same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape1) <= MAX_RANK);
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(tensor_size(shape1) != tensor_size(shape));
for_each(index in shape) {
// Calculate flattened index for the output location (index)
size_t offset = tensor_index_to_offset(shape, index);
// Now convert to the location in the input
dim_t tmp_index = tensor_offset_to_index(shape1, offset);
// Now read/write the value
in_out_t val = tensor_read<in_out_t>(input1, shape1, tmp_index);
tensor_write<in_out_t>(output, shape, index, val);
}
2.10.5. REVERSE
Returns a tensor with the same type/values as the input, with the data reversed along the given axis. No data conversion happens during a reverse operation.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input | shape | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | axis | - | 0 | Axis to reverse, in range from 0 to rank(shape)-1 |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor. Same shape as input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
Any | shape | shape_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(axis < 0 || axis >= rank(shape));
for_each(index in shape) {
dim_t tmp_index = index;
tmp_index[axis] = shape[axis] - 1 - index[axis];
in_out_t value = tensor_read<in_out_t>(input, shape, tmp_index);
tensor_write<in_out_t>(output, shape, index, value);
}
2.10.6. SLICE
Extracts a slice of input1, beginning at the start coordinates, and extending for size elements in each direction. No data conversion happens during a slice operation.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<index_t> | start | [rank(shape1)] | 1 | List of integer coordinates, of length equal to the rank of input1. Start coordinate for slicing. |
Attribute | T<index_t> | size | [rank(shape1)] | 1 | List of integer size values, of length equal to the rank of input1. Size of the input to be used. |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor of same type as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(rank(shape1) != length(start) || rank(shape1) != length(size));
ERROR_IF(rank(shape1) != rank(shape));
// Sanity check the given coordinates, ensure start and end are
// within tensor bounds
for_each(index in rank(shape1)) {
ERROR_IF(start[index] < 0);
ERROR_IF(size[index] <= 0); //Output must be positive size
ERROR_IF(start[index] + size[index] > shape1[index]);
ERROR_IF(shape[index] != size[index]);
}
for_each(index in shape) {
dim_t tmp_index = index;
for(i = 0; i < rank(shape); i++) {
tmp_index[i] = index[i] + start[i];
}
in_out_t value = tensor_read<in_out_t>(input1, shape1, tmp_index);
tensor_write<in_out_t>(output, shape, index, value);
}
2.10.7. TILE
Replicates input1 multiples times along each dimension.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 1 to MAX_RANK | Input tensor |
Input | T<shape_t> | multiples | [rank(shape1)] | 1 | Number of times to replicate input1 in each dimension |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor of same type, rank as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(rank(shape1) != rank(shape));
for_each(index in shape) {
dim_t tmp_index = index;
for(i = 0; i < rank(shape); i++) {
ERROR_IF(shape1[i] * multiples[i] != shape[i]);
tmp_index[i] = index[i] % shape1[i];
}
in_out_t value = tensor_read<in_out_t>(input1, shape1, tmp_index);
tensor_write<in_out_t>(output, shape, index, value);
}
2.10.8. TRANSPOSE
Permutes the dimensions of the input tensor input1 based on the perms argument. Each value in the perms list must be a valid dimension of the input tensor and may not be repeated.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape1 | 1 to MAX_RANK | Input tensor |
Attribute | T<i32_t> | perms | [rank(shape1)] | 1 | List of integers of length equal to the rank of input1. Values must be valid dimensions within shape1, and may not be repeated. |
Output | T<in_out_t> | output | shape | 1 to MAX_RANK | Output tensor of same type, rank as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | boolean | bool_t |
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
ERROR_IF(rank(shape1) != rank(shape));
ERROR_IF(tensor_size(shape1) != tensor_size(shape));
for_each(index in perms) {
// Ensure each perms value is a valid value
ERROR_IF(index >= rank(shape1));
ERROR_IF(index < 0);
// Ensure ranks aren't repeated
ERROR_IF(indexes_used[index] == true);
indexes_used[index] = true;
}
// Ensure that the output shapes have the properly
// permuted shapes
for(i = 0; i < rank(shape); i++) {
ERROR_IF(shape1[perms[i]] != shape[i])
}
for_each(index in shape) {
dim_t tmp_index = index;
for(i = 0; i < rank(shape); i++) {
tmp_index[perms[i]] = index[i]
}
in_out_t value = tensor_read<in_out_t>(input1, shape1, tmp_index);
tensor_write<in_out_t>(output, shape, index, value);
}
2.11. Scatter/Gather Operators
2.11.1. GATHER
Generate a tensor for which each element in the output is a subtensor of the values tensor based on the indices. N is the number of batches, W the number of indices in each batch, K the range of each index and C the number data channels for each index.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | values | [N,K,C] | 3 | 3D value tensor |
Input | T<index_t> | indices | [N,W] | 2 | 2D index tensor |
Output | T<in_out_t> | output | [N,W,C] | 3 | 3D output tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
for_each(0 <= n < N, 0 <= w < W, 0 <= c < C) {
index_t k = tensor_read<index_t>(indices, [N,W], [n,w]);
REQUIRE(0 <= k && k < K);
in_out_t value = tensor_read<in_out_t>(values, [N,K,C], [n,k,c]);
tensor_write<in_out_t>(output, [N,W,C], [n,w,c], value);
}
2.11.2. SCATTER
The values_out tensor is set to the values_in tensor with data modified as follows: data from the input tensor is inserted at the positions specified by the indices tensor. N is the number of batches, W the number of indices in each batch, K the range of each index and C the number data channels for each index. It is not permitted to repeat the same output index within a single SCATTER operation and so each output index occurs at most once. In use cases that require multiple updates to the same output position, these must be decomposed into multiple SCATTER operations.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | values_in | [N,K,C] | 3 | 3D values in tensor |
Input | T<index_t> | indices | [N,W] | 2 | 2D index tensor |
Input | T<in_out_t> | input | [N,W,C] | 3 | 3D input tensor |
Output | T<in_out_t> | values_out | [N,K,C] | 3 | 3D output tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | signed 8 | i8_t |
Any | signed 16 | i16_t |
Any | signed 32 | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
// The following array is used to check compliance that an output position
// is modified at most once.
bool_t output_modified[N,K,C];
// Copy the values_in tensor to the values_out tensor.
// Values not written by the scatter operation are unchanged in the output.
for_each(0 <= n < N, 0 <= k < K, 0 <= c < C) {
in_out_t value = tensor_read<in_out_t>(values_in, [N,K,C], [n,k,c]);
tensor_write<in_out_t>(values_out, [N,K,C], [n, k, c], value);
output_modified[n,k,c]=false;
}
// Now perform the SCATTER operation, modifying the positions from the indices tensor
for_each(0 <= n < N, 0 <= w < W, 0 <= c < C) {
index_t k = tensor_read<index_t>(indices, [N,W], [n,w]);
REQUIRE(0 <= k && k < K);
REQUIRE(output_modified[n,k,c] == false);
in_out_t value = tensor_read<in_out_t>(input, [N,W,C], [n,w,c]);
tensor_write<in_out_t>(values_out, [N,K,C], [n, k, c], value);
output_modified[n,k,c] = true;
}
2.12. Image Operators
2.12.1. RESIZE
Resizes a tensor. Resize is only allowed in the H and W dimensions.
The height dimension is scaled by factor (scale_y_n/scale_y_d). The width dimension is scaled by factor (scale_x_n/scale_x_d).
The NEAREST_NEIGHBOR mode returns the value of the input tensor closest to the calculated sample position for both floating-point and integer data formats.
Floating-point BILINEAR mode returns a bilinearly interpolated output value based on the four closest input sample positions.
For integer BILINEAR interpolation mode, the output value must be scaled by 1/(scale_y_n * scale_x_n) in a following operation to complete the interpolation (for example with a RESCALE operator).
The following examples show practical uses of the parameters:
-
For approximate uniform input sampling between (0, 0) and (IH - 1, IW - 1) set
-
scale_y_n/scale_y_d = (OH - 1)/(IH - 1) as integer ratios
-
scale_x_n/scale_x_d = (OW - 1)/(IW - 1) as integer ratios
-
offset_x = 0, offset_y = 0, border_x = 0, border_y = 0
-
-
For power of two upscale [OH - 1,OW - 1] = (1 << k) * [IH - 1, IW - 1], sampling between (0,0) and (IH - 1,IW - 1), set:
-
scale_y_n = (1 << k), scale_y_d = 1, offset_y = 0, border_y = 0
-
scale_x_n = (1 << k), scale_x_d = 1, offset_x = 0, border_x = 0
-
-
For power of two upscale [OH,OW] = (1 << k) * [IH,IW], sampling range approximately (-0.5, -0.5) to (IH - 0.5, IW - 0.5), set:
-
scale_y_n = 2 << k, scale_y_d = 2, offset_y = -(1 << k) + 1, border_y = (1 << k) - 1
-
scale_x_n = 2 << k, scale_x_d = 2, offset_x = -(1 << k) + 1, border_x = (1 << k) - 1
-
The output dimensions can be derived from the input dimensions by inverting the scale as described in the pseudocode. The [border_y, border_x] values adjust the output size to allow fractional sampling beyond integer input position (IH - 1,IW - 1).
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | [N,IH,IW,C] | 4 | Input tensor |
Input | T<shape_t> | scale | [4] | 1 | [scale_y_n, scale_y_d, scale_x_n, scale_x_d] |
Input | T<shape_t> | offset | [2] | 1 | [offset_y, offset_x] |
Input | T<shape_t> | border | [2] | 1 | [border_y, border_x] |
Attribute | T<resize_mode_t> | mode | - | 0 | BILINEAR or NEAREST |
Output | T<out_t> | output | [N,OH,OW,C] | 4 | Output tensor |
Supported Data Types:
Profile | Mode | resize_t | in_t | out_t |
---|---|---|---|---|
Any | signed 8, bilinear | i16_t | i8_t | i32_t |
Any | signed 8, nearest | i16_t | i8_t | i8_t |
Any | signed 16, bilinear | i16_t | i16_t | i48_t |
Any | signed 16, nearest | i16_t | i16_t | i16_t |
MI, MT | fp16 | fp16_t | fp16_t | fp16_t |
MI, MT | bf16 | bf16_t | bf16_t | bf16_t |
MI, MT | fp32 | fp32_t | fp32_t | fp32_t |
Operation Function:
LEVEL_CHECK(scale_y_n/scale_y_d <= MAX_SCALE);
LEVEL_CHECK(scale_x_n/scale_x_d <= MAX_SCALE);
Resize Modes:
Mode | Description |
---|---|
NEAREST | Nearest Neighbor |
BILINEAR | Bilinear interpoloation |
// Ensure the image size is supported by GPU APIs and that for integer
// implementations, position * stride does not overflow int32_t.
ERROR_IF(max(OH,OW,IH,IW) >= 16384);
ERROR_IF(scale_y_n <= 0 || scale_y_d <= 0 || scale_x_n <= 0 || scale_x_d <= 0);
// if in_t=int8_t ensure that an int32_t accumulator can be used
ERROR_IF(scale_y_n > (1 << 11) || scale_x_n > (1 << 11));
// set a consistent lower limit of 1/16 downscale to simplify implementations
ERROR_IF(scale_y_d >= 16 * scale_y_n || scale_x_d >= 16 * scale_x_n);
ERROR_IF(offset_y < -scale_y_n || offset_y >= 16 * scale_y_n);
ERROR_IF(offset_x < -scale_x_n || offset_x >= 16 * scale_x_n);
ERROR_IF(border_y < -16 * scale_y_n || border_y >= scale_y_n);
ERROR_IF(border_x < -16 * scale_x_n || border_x >= scale_x_n);
ERROR_IF(OH != idiv_check((IH - 1) * scale_y_n - offset_y + border_y, scale_y_d) + 1);
ERROR_IF(OW != idiv_check((IW - 1) * scale_x_n - offset_x + border_x, scale_x_d) + 1);
for_each(0 <= n < N, 0 <= oy < OH, 0 <= ox < OW; 0 <= c < C) {
out_t acc;
resize_t dx, dy;
resize_t unit_x, unit_y;
unit_x = (is_floating_point(resize_t)) ? 1.0 : scale_x_n;
unit_y = (is_floating_point(resize_t)) ? 1.0 : scale_y_n;
int32_t y = oy * scale_y_d + offset_y;
int32_t x = ox * scale_x_d + offset_x;
int16_t iy = floor(y / scale_y_n);
int16_t ix = floor(x / scale_x_n);
int16_t ry = y - iy * scale_y_n; // (y % scale_y_n)
int16_t rx = x - ix * scale_x_n; // (x % scale_x_n)
if (is_floating_point(resize_t)) {
dy = static_cast<resize_t>(ry) / static_cast<resize_t>(scale_y_n);
dx = static_cast<resize_t>(rx) / static_cast<resize_t>(scale_x_n);
} else {
dy = ry;
dx = rx;
}
// Note that -1 <= iy < IH and -1 <= ix < IW
int16_t iy0 = apply_max_s(iy, 0);
int16_t iy1 = apply_min_s(iy + 1, IH - 1);
int16_t ix0 = apply_max_s(ix, 0);
int16_t ix1 = apply_min_s(ix + 1, IW - 1);
if (mode==BILINEAR) {
using in_s_t = make_signed(in_t); // Use signed calculations for i8/i16
in_s_t v00 = static_cast<in_s_t>(tensor_read<in_t>(input, [N,IH,IW,C], [n,iy0,ix0,c]));
in_s_t v01 = static_cast<in_s_t>(tensor_read<in_t>(input, [N,IH,IW,C], [n,iy0,ix1,c]));
in_s_t v10 = static_cast<in_s_t>(tensor_read<in_t>(input, [N,IH,IW,C], [n,iy1,ix0,c]));
in_s_t v11 = static_cast<in_s_t>(tensor_read<in_t>(input, [N,IH,IW,C], [n,iy1,ix1,c]));
acc = v00 * (unit_y - dy) * (unit_x - dx);
acc += v01 * (unit_y - dy) * dx;
acc += v10 * dy * (unit_x - dx);
acc += v11 * dy * dx;
tensor_write<out_t>(output, [N,OH,OW,C], [n,oy,ox,c], acc);
} else if (mode==NEAREST) {
int32_t iy, ix;
if (is_floating_point(resize_t)) {
iy = (dy >= 0.5) ? iy1 : iy0;
ix = (dx >= 0.5) ? ix1 : ix0;
} else {
iy = (2 * dy >= scale_y_n) ? iy1 : iy0;
ix = (2 * dx >= scale_x_n) ? ix1 : ix0;
}
in_t v = tensor_read<in_t>(input, [N,IH,IW,C], [n,iy,ix,c]);
tensor_write<out_t>(output, [N,OH,OW,C], [n,oy,ox,c], v);
}
}
2.13. Type Conversion
2.13.1. CAST
Casts a tensor from one data type to another.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | shape | 0 to MAX_RANK | Input tensor |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | bool to signed 8 | bool_t | i8_t |
Any | bool to signed 16 | bool_t | i16_t |
Any | bool to signed 32 | bool_t | i32_t |
Any | signed 8 to bool | i8_t | bool_t |
Any | signed 8 to signed 16 | i8_t | i16_t |
Any | signed 8 to signed 32 | i8_t | i32_t |
MI, MT | signed 8 to fp16 | i8_t | fp16_t |
MI, MT | signed 8 to bf16 | i8_t | bf16_t |
MI, MT | signed 8 to fp32 | i8_t | fp32_t |
Any | signed 16 to bool | i16_t | bool_t |
Any | signed 16 to signed 8 | i16_t | i8_t |
Any | signed 16 to signed 32 | i16_t | i32_t |
MI, MT | signed 16 to fp16 | i16_t | fp16_t |
MI, MT | signed 16 to bf16 | i16_t | bf16_t |
MI, MT | signed 16 to fp32 | i16_t | fp32_t |
Any | signed 32 to bool | i32_t | bool_t |
Any | signed 32 to signed 8 | i32_t | i8_t |
Any | signed 32 to signed 16 | i32_t | i16_t |
MI, MT | signed 32 to fp16 | i32_t | fp16_t |
MI, MT | signed 32 to bf16 | i32_t | bf16_t |
MI, MT | signed 32 to fp32 | i32_t | fp32_t |
MI, MT | bf16 to signed 8 | bf16_t | i8_t |
MI, MT | bf16 to signed 16 | bf16_t | i16_t |
MI, MT | bf16 to signed 32 | bf16_t | i32_t |
MI, MT | bf16 to fp32 | bf16_t | fp32_t |
MI, MT | fp16 to signed 8 | fp16_t | i8_t |
MI, MT | fp16 to signed 16 | fp16_t | i16_t |
MI, MT | fp16 to signed 32 | fp16_t | i32_t |
MI, MT | fp16 to fp32 | fp16_t | fp32_t |
MI, MT | fp32 to signed 8 | fp32_t | i8_t |
MI, MT | fp32 to signed 16 | fp32_t | i16_t |
MI, MT | fp32 to signed 32 | fp32_t | i32_t |
MI, MT | fp32 to bf16 | fp32_t | bf16_t |
MI, MT | fp32 to fp16 | fp32_t | fp16_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
in_t in = tensor_read<in_t>(input, shape, index);
out_t out;
if (out_t == bool_t) {
out = (in != 0) ? true : false;
} else if (in_t == bool_t) {
out = (in) ? 1 : 0;
} else if (out_t == fp16_t || out_t == bf16_t || out_t == fp32_t) {
out = round_to_nearest_float(in);
} else if (in_t == fp16_t || in_t == bf16_t || in_t == fp32_t) {
out = apply_clip<out_t>(round_to_nearest_int(in), minimum<out_t>, maximum<out_t>);
} else if (sizeof(out_t) >= sizeof(in_t)) {
out = sign_extend<out_t>(in);
} else {
out = truncate(in);
}
tensor_write<out_t>(output, shape, index, out)
}
2.13.2. RESCALE
Rescale quantized values into a new domain. This function scales by factor: multiplier * 2-shift.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_t> | input | shape | 0 to MAX_RANK | Input tensor |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor with the same shape as input |
Attribute | T<in_t> | input_zp | - | 0 | Input tensor zero point. int8/uint8 can have zero point within their valid range. uint16 zero point must be either 0 or 32768. All other types must have zero point equal to 0. |
Attribute | T<out_t> | output_zp | - | 0 | Output tensor zero point.int8/uint8 can have zero point within their valid range. uint16 zero point must be either 0 or 32768. All other types must have zero point equal to 0. |
Input (MT profile) Attribute (BI/MI profiles) | T<mul_t> | multiplier | [NC] | 1 | Scaling multiplier array |
Input (MT profile) Attribute (BI/MI profiles) | T<i8_t> | shift | [NC] | 1 | Scaling shift array |
Attribute | T<bool_t> | scale32 | - | 0 | if (scale32) mul_t=i32_t else mul_t=i16_t |
Attribute | T<bool_t> | double_round | - | 0 | Select double round mode |
Attribute | T<bool_t> | per_channel | - | 0 | if (per_channel) NC=shape[rank(shape)-1] else NC=1 |
Attribute | T<bool_t> | input_unsigned | - | 0 | If True, treat the input values as unsigned. |
Attribute | T<bool_t> | output_unsigned | - | 0 | If True, treat the output values as unsigned. |
Supported Data Types:
Profile | Mode | in_t | out_t |
---|---|---|---|
Any | 8-bit to 8-bit | i8_t | i8_t |
Any | 8-bit to 16-bit | i8_t | i16_t |
Any | 8-bit to 32-bit | i8_t | i32_t |
Any | 16-bit to 8-bit | i16_t | i8_t |
Any | 16-bit to 16-bit | i16_t | i16_t |
Any | 16-bit to 32-bit | i16_t | i32_t |
Any | 32-bit to 8-bit | i32_t | i8_t |
Any | 32-bit to 16-bit | i32_t | i16_t |
Any | 32-bit to 32-bit | i32_t | i32_t |
Any | 48-bit to 8-bit | i48_t | i8_t |
Any | 48-bit to 16-bit | i48_t | i16_t |
Any | 48-bit to 32-bit | i48_t | i32_t |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
for_each(index in shape) {
// uint16 values can have zero_point 0 or 32768
// int8/uint8 can have zero point within their valid range
// No other types can have zero point != 0
ERROR_IF(in_t != i8_t &&
(in_t != i16_t || input_unsigned == False) && input_zp != 0);
ERROR_IF(out_t != i8_t &&
(out_t != i16_t || output_unsigned == False) && output_zp != 0);
ERROR_IF(in_t == i16_t && input_unsigned == True && input_zp != 0 && input_zp != 32768);
ERROR_IF(out_t == i16_t && output_unsigned == True && output_zp != 0 && output_zp != 32768);
ERROR_IF(scale32 && in_t == i48_t);
ERROR_IF(!scale32 && double_round);
ERROR_IF(in_t == i16_t && out_t == i32_t && input_unsigned);
ERROR_IF(in_t == i32_t && out_t == i16_t && output_unsigned);
in_t in_value = tensor_read<in_t>(input, shape, index);
int48_t value, extended_in_zp;
if (input_unsigned) {
value = zero_extend<int48_t>(in_value);
extended_in_zp = zero_extend<int48_t>(input_zp);
}
else {
value = sign_extend<int48_t>(value);
extended_in_zp = sign_extend<int48_t>(input_zp);
}
value = value - extended_in_zp;
int c = (per_channel) ? index[rank(input) - 1] : 0;
int32_t result = (scale32) ?
apply_scale_32(value, multiplier[c], shift[c], double_round) :
apply_scale_16(value, multiplier[c], shift[c]);
if (output_unsigned) {
int32_t extended_out_zp = zero_extend<int32_t>(output_zp);
result = apply_add_s<int32_t>(result, extended_out_zp);
out_t out = static_cast<out_t>(apply_clip<int32_t>(result,
minimum_u<out_t>,
maximum_u<out_t>));
}
else {
int32_t extended_out_zp = sign_extend<int32_t>(output_zp);
result = apply_add_s<int32_t>(result, extended_out_zp);
out_t out = static_cast<out_t>(apply_clip<int32_t>(result,
minimum_s<out_t>,
maximum_s<out_t>));
}
tensor_write<out_t>(output, shape, index, out);
}
2.14. Data Nodes
2.14.1. CONST
A node containing constant data for use as the input to an operation. May hold data in any of the supported data formats.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Attribute | T<out_t> | values | shape | 0 to MAX_RANK | Constant values |
Output | T<out_t> | output | shape | 0 to MAX_RANK | Output tensor of the same type, size as the input tensor |
Supported Data Types:
Profile | Mode | out_t |
---|---|---|
Any | Boolean | bool_t |
Any | 4-bit | i4_t |
Any | 8-bit | i8_t |
Any | 16-bit | i16_t |
Any | 32-bit | i32_t |
Any | 48-bit | i48_t |
Any | shape | shape_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
2.14.2. IDENTITY
Returns a tensor with the same shape, type, and contents as the input.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<in_out_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Output | T<in_out_t> | output | shape | 0 to MAX_RANK | Output tensor of the same type, size as the input tensor |
Supported Data Types:
Profile | Mode | in_out_t |
---|---|---|
Any | Boolean | bool_t |
Any | 8-bit | i8_t |
Any | 16-bit | i16_t |
Any | 32-bit | i32_t |
MI, MT | fp16 | fp16_t |
MI, MT | bf16 | bf16_t |
MI, MT | fp32 | fp32_t |
Operation Function:
2.15. Custom Operators
Hardware implementing TOSA may choose to add additional custom operators that are not expressed in the existing TOSA operations. These operators are not expected to be portable across TOSA implementations. The input and output signatures must be expressed in the corresponding TOSA node.
2.15.1. CUSTOM
Runs an implementation defined custom operator. CUSTOM operators are not tested in the conformance suite as results will be implementation defined. The domain
attribute should be unique to each implementation. To achieve this, using a domain name as the domain
attribute is recommended.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | tensor_list_t | input_list | - | List of input tensors | |
Attribute | String | operator | - | String which tells the backend which custom operator is being called | |
Attribute | String | domain | - | String idenifier which can help avoid name collisions on the operator field. Different implementations of a given operator would be in different domains. Implementations can choose which domains they want to support. | |
Attribute | String | implementation_attrs | - | String value containing implementation specific attributes which apply to the operation | |
Output | tensor_list_t | output_list | - | List of output tensors |
Operation Function:
// Implementation defined behavior
2.16. Control Flow Operators
TOSA implements two control flow operators, for conditional branching and loop based control. Both have attributes that are TOSA sub-graphs.
2.16.1. COND_IF
Evaluates a Boolean condition and then takes one of two distinct execution paths. This implements the semantic if-then-else structure.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | T<bool_t> | condition | shape | 1 to MAX_RANK | Input condition as a size 1 tensor |
Input | tensor_list_t | input_list | - | List of input tensors | |
Attribute | tosa_graph_t | then_graph | - | TOSA graph to execute if condition is true | |
Attribute | tosa_graph_t | else_graph | - | TOSA graph to execute if condition is false | |
Output | tensor_list_t | output_list | - | List of output tensors |
Operation Function:
ERROR_IF(tosa_nesting_depth >= MAX_NESTING);
ERROR_IF(tensor_list_shape(input_list) != tosa_input_shape(then_graph));
ERROR_IF(tensor_list_shape(input_list) != tosa_input_shape(else_graph));
ERROR_IF(tensor_list_shape(output_list) != tosa_output_shape(then_graph));
ERROR_IF(tensor_list_shape(output_list) != tosa_output_shape(else_graph));
ERROR_IF(tensor_size(shape) != 1);
tosa_nesting_depth++;
if (condition[0]) {
tosa_execute_graph(then_graph, input_list, output_list);
} else {
tosa_execute_graph(else_graph, input_list, output_list);
}
tosa_nesting_depth--;
2.16.2. WHILE_LOOP
Generates and evaluates a Bool condition and either executes a loop body or exits the loop. This action is performed repeatedly after updating and re-evaluating the Boolean condition every iteration. This implements the semantic foreach or while iterative loop structure.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Input | tensor_list_t | input_list | - | List of input tensors | |
Attribute | tosa_graph_t | cond_graph | - | TOSA graph to evaluate the condition | |
Attribute | tosa_graph_t | body_graph | - | TOSA graph to execute the loop body | |
Output | tensor_list_t | output_list | - | List of output tensors |
Operation Function:
ERROR_IF(tosa_nesting_depth >= MAX_NESTING);
ERROR_IF(tensor_list_shape(input_list) != tosa_list_shape(output_list));
ERROR_IF(tensor_list_shape(input_list) != tosa_input_shape(cond_graph));
ERROR_IF(tensor_list_shape(input_list) != tosa_input_shape(body_graph));
ERROR_IF(tensor_list_shape(input_list) != tosa_output_shape(body_graph));
// Condition graph output must be a single element tensor with a single bool value
ERROR_IF(tensor_size(tosa_output_shape(cond_graph)) != 1);
ERROR_IF(tosa_output_type(cond_graph) != bool_t);
// The iteration number 'i' is included to give unique names to variables
// in each iteration of the loop and is not required by implementations
int32_t i=0; // iteration number
tensor_list_t list[]; // array of tensor lists indexed by iteration
bool_t *condition[]; // array of condition tensors indexed by iteration
list[i] = input_list; // copy input data as list[0]
tosa_nesting_depth++;
tosa_execute_graph(cond_graph, list[i], [ condition[i] ]); // initial condition
while (condition[i][0]) {
tosa_execute_graph(body_graph, list[i], list[i+1]);
i = i+1;
tosa_execute_graph(cond_graph, list[i], [ condition[i] ]);
}
tosa_nesting_depth--;
output_list = list[i];
2.17. Variable Operators
TOSA implements three variable operators for expressing persistent mutable values across multiple TOSA graph invocations.
2.17.1. VARIABLE
Defines a new TOSA variable. This is a persistent mutable value across multiple TOSA graph invocations. Modifications are expressed using read/write semantics.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Attribute | T<i32_t> | uid | - | 0 | Globally unique identifier for the declared variable tensor. |
Attribute | T<index_t> | var_shape | var_shape | 1 | The variable tensor shape |
Attribute | T<var_t> | type | - | 0 | Type of the tensor variable elements. |
Attribute | T<in_t> | initial_value | shape | 0 to MAX_RANK | Initial value of the variable tensor. This argument is optional with default value NULL. |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
tensor_t var_tensor = variable_tensor_lookup(uid);
// Invocation for the first time
if (var_tensor == NULL) {
// Allocate the persistent mutable memory for the variable tensor
tensor_t var_tensor = variable_tensor_allocate<var_t>(var_shape, uid);
if (initial_value != NULL) {
REQUIRE(var_t == in_t);
REQUIRE(var_shape == shape);
for_each (index in shape) {
// Copy data from initial_value to var_tensor
in_t value = tensor_read<in_t>(initial_value, shape, index);
tensor_write<in_t>(var_tensor.data, var_shape, index, value);
}
var_tensor.is_written = true;
}
} else { // Variable tensor has already been declared
// It's invalid to declare the second variable with the same uid in a single graph execution,
REQUIRE(!var_tensor.seen);
}
var_tensor.seen = true;
2.17.2. VARIABLE_WRITE
Assigns a value to the pseudo-buffer resource holding a persistent mutable tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Attribute | T<i32_t> | uid | - | 0 | Globally unique identifier of the variable tensor that is writing to |
Input | T<in_t> | input1 | shape | 0 to MAX_RANK | Input tensor |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
tensor_t. variable_tensor = variable_tensor_lookup(uid);
// Check this variable tensor has been declared
REQUIRE(variable_tensor);
// The tensor has to be seen before to be written to
// The seen variable is cleared before each graph execution and set in declaration
REQUIRE(variable_tensor.seen);
// Input tensor's shape and variable_tensor's shape have to match
REQUIRE(variable_tensor.shape == shape);
// Input tensor's shape and variable_tensor's type have to match
REQUIRE(variable_tensor.type == in_t);
for_each (index in shape) {
// Write data from the input to the pseudo-buffer resource
in_t value = tensor_read<in_t>(input1, shape, index);
tensor_write<tensor_t>(variable_tensor.data, variable_tensor.shape, index, value);
}
variable_tensor.is_written = true;
2.17.3. VARIABLE_READ
Reads the value from a pseudo-buffer resource holding a persistent mutable tensor.
Arguments:
Argument | Type | Name | Shape | Rank | Description |
---|---|---|---|---|---|
Attribute | T<i32_t> | uid | - | 0 | Globally unique identifier of the variable tensor that is reading from |
Output | T<out_t> | output1 | shape | 0 to MAX_RANK | Output tensor |
Operation Function:
LEVEL_CHECK(rank(shape) <= MAX_RANK);
tensor_t variable_tensor = variable_tensor_lookup(uid);
// Check this variable tensor has been decalred
REQUIRE(variable_tensor != NULL);
// Check this variable tensor has been written
REQUIRE(variable_tensor.is_written);
// Output tensor's shape and variable_tensor's shape have to match
REQUIRE(variable_tensor.shape == shape);
// Output tensor's shape and variable_tensor's type have to match
REQUIRE(variable_tensor.type == out_t);
for_each (index in shape) {
// Read data from pseudo-buffer resource to the output
out_t value = tensor_read<tensor_t>(variable_tensor.data, variable_tensor.shape, index);
tensor_write<out_t>(input1, shape, index, value);
}
3. Enumerations
Where enumerated types are specified for an operator, the provided value must be a valid enumerant for that type. The included tables provide reference values for the enumerations. Implementations do not need to use these values, they may substitute other values as long as they are functionally equivalent.
3.1. resize_mode_t
Valid resize types
Name | Value | Description |
---|---|---|
NEAREST_NEIGHBOR | 0 | Nearest neighbor resize |
BILINEAR | 1 | Bilinear resize |
3.2. acc_size_t
Allowed accumulator sizes
Name | Value | Description |
---|---|---|
INT32 | 0 | 32-bit integer |
FP16 | 1 | 16-bit floating-point |
FP32 | 2 | 32-bit floating-point |
3.3. var_t
Variable tensor data type
Name | Value | Description |
---|---|---|
BOOLEAN | 0 | Boolean |
INT8 | 1 | 8-bit integer |
INT16 | 2 | 16-bit integer |
INT32 | 3 | 32-bit integer |
FP16 | 4 | 16-bit floating-point |
BF16 | 5 | 16-bit brain floating-point |
FP32 | 6 | 32-bit floating-point |
4. TOSA Pseudocode
The TOSA pseudocode provides precise descriptions of TOSA operations. Each operator contains pseudocode describing the operator’s functionality. This section contains pseudocode functions shared across multiple operators in the specification.
4.1. Operator Validation Helpers
The following functions are used to define the valid conditions for TOSA operators.
The REQUIRE function defines the conditions required by the TOSA operator. If the conditions are not met then the result of the TOSA graph is marked as unpredictable. Once the tosa_graph_result is set to tosa_unpredictable, the whole graph is considered unpredictable.
The ERROR_IF function defines a condition that must set an error if the condition holds and the graph is not unpredictable. Note that if a graph contains both unpredictable and error statements then result of tosa_execute_graph() is tosa_unpredictable. This condition is captured in the ERROR_IF function.
Implementation Notes
-
An implementation is not required to detect unpredictable behavior. If tosa_execute_graph() returns tosa_unpredictable then the tosa_test_compliance() function does not require any specific output from an implementation.
-
An implementation is required to detect errors in a graph that does not have unpredictable behavior (see tosa_test_compliance).
-
An acceptable implementation is to stop and report an error on the first ERROR_IF condition that occurs. This satifies tosa_test_compliance() even if the tosa_execute_graph() was tosa_unpredictable.
-
If the tosa_execute_graphs() result is tosa_unpredictable or tosa_error, then there is no requirement on the implementation to execute any portion of the TOSA graph.
void REQUIRE(condition) {
// Unpredictable overrides any previous result
if (!(condition)) {
tosa_graph_result = tosa_unpredictable;
}
}
void ERROR_IF(condition) {
// Error encodes a predictable error state and so is not registered
// if the graph is marked as unpredictable.
if (tosa_graph_result != tosa_unpredictable && condition) {
tosa_graph_result = tosa_error;
}
}
void LEVEL_CHECK(condition) {
// If a level is specified and the level condition fails then
// the result is unpredictable.
REQUIRE(condition);
}
4.2. Tensor Access Helpers
4.2.1. Tensor Utilities
// Convert tensor index coordinates to an element offset
size_t tensor_index_to_offset(dim_t shape, dim_t index) {
size_t size = tensor_size(shape); // check tensor shape is valid
size_t offset = 0;
for (int32_t i = 0; i < rank(shape); i++) {
REQUIRE(index[i] >= 0 && index[i] < shape[i]);
offset = offset * shape[i] + index[i];
}
return offset;
}
// Convert an element offset to tensor index coordinates
dim_t tensor_offset_to_index(dim_t shape, size_t offset) {
size_t size = tensor_size(shape); // check tensor shape is valid
REQUIRE(offset < size);
dim_t index(rank(shape)); // index has rank(shape) indicies
for(int32_t i = rank(shape) - 1; i >= 0; i--) {
index[i] = offset % shape[i];
offset /= shape[i];
}
return index;
}
// Check the tensor shape is valid and return the tensor size in elements
size_t tensor_size(dim_t shape) {
size_t size = 1;
for (int32_t i = 0; i < rank(shape); i++) {
REQUIRE(1 <= shape[i] && shape[i] <= maximum<size_t> / size);
size *= shape[i];
}
return size;
}
// Return the size of the tensor in the given axis
// For a rank=0 tensor, returns 1 for all axes
size_t shape_dim(dim_t shape, int axis) {
return (axis >= rank(shape)) ? 1 : shape[axis];
}
4.2.2. Tensor Read
tensor_read reads a single data value out of the given tensor. The shape argument contains the shape of the tensor. Index is the coordinates within the tensor of the value to be read.
in_t tensor_read<in_t>(in_t *address, dim_t shape, dim_t index) {
size_t offset = tensor_index_to_offset(shape, index);
return address[offset];
}
4.2.3. Tensor Write
tensor_write writes a single data value into the given tensor. The shape argument contains the shape of the tensor. Index is the coordinates within the tensor of the value to be written. value is the value to be written to the given coordinate.
void tensor_write<type>(<type> *address, dim_t shape, dim_t index, <type> value) {
size_t offset = tensor_index_to_offset(shape, index);
address[offset] = value;
}
4.2.4. Variable Tensor Allocate
variable_tensor_allocate allocates the mutable persistent memory block for storing variable tensors. The shape argument contains the shape of the allocated memory block for the variable_tensor. The uid argument is a globally unique identifier for variable tensors.
tensor_t* variable_tensor_allocate<in_t>(dim_t shape, int32_t uid) {
size_t size = tensor_size(shape);
tensor_t *allocated_tensor = new tensor_t;
allocated_tensor->data = new in_t[size];
allocated_tensor->uid = uid;
allocated_tensor->is_written = false;
allocated_tensor->shape = shape;
allocated_tensor->type = in_t;
return allocated_tensor;
}
4.2.5. Variable Tensor Lookup
variable_tensor_lookup checks whether a variable tensor has been allocated or not. The uid argument is a globally unique identifier for variable tensors.
tensor_t variable_tensor_lookup(int32_t uid) {
// The global all_allocated_variable_tensors was instantiated at the first
// time of executing the tosa graph
for_each(tensor_t allocated_tensor in all_allocated_variable_tensors) {
if (allocated_tensor.uid == uid) {
return allocated_tensor;
}
}
return NULL;
}
4.2.6. Broadcast Helpers
The following function derives the broadcast output shape from the input shapes.
dim_t broadcast_shape(dim_t shape1, dim_t shape2) {
ERROR_IF(rank(shape1) != rank(shape2));
dim_t shape = shape1;
for (int32_t i = 0; i < rank(shape); i++) {
if (shape[i] == 1) {
shape[i] = shape2[i];
} else {
ERROR_IF(shape2[i] != 1 && shape2[i] != shape[i]);
}
}
return shape;
}
The following function maps an index in the output tensor to an index in the input tensor.
// The index argument should be a valid location within out_shape.
// The function returns the location within in_shape that contributes
// to the output based on broadcasting rules.
dim_t apply_broadcast(dim_t out_shape, dim_t in_shape, dim_t index) {
ERROR_IF(rank(out_shape) != rank(in_shape));
ERROR_IF(rank(out_shape) != rank(index));
for (int32_t i = 0; i < rank(out_shape); i++) {
if (out_shape[i] != in_shape[i]) {
ERROR_IF(in_shape[i] != 1);
index[i] = 0;
}
}
return index;
}
4.3. General Pseudocode Helpers
This section contains general pseudocode utility functions used throughout the specification.
4.3.1. Arithmetic Helpers
The following functions provide arithmetic while defining requirements such that values stay in the valid range.
in_t apply_add_s<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) return a + b;
int64_t c = sign_extend<int64_t>(a) + sign_extend<int64_t>(b);
REQUIRE(c >= minimum_s<in_t> && c <= maximum_s<in_t>);
return static_cast<in_t>(c);
}
in_t apply_add_u<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) return a + b;
uint64_t c = zero_extend<uint64_t>(a) + zero_extend<uint64_t>(b);
REQUIRE(c >= minimum_u<in_u_t> && c <= maximum_u<in_u_t>);
return truncate<in_t>(c);
}
in_t apply_arith_rshift<in_t>(in_t a, in_t b) {
int32_t c = sign_extend<int32_t>(a) >> sign_extend<int32_t>(b);
return static_cast<in_t>(c);
}
in_t apply_intdiv_s<in_t>(in_t a, in_t b) {
int64_t c = sign_extend<int64_t>(a) / sign_extend<int64_t>(b);
REQUIRE(c >= minimum_s<in_t> && c <= maximum_s<in_t>);
return static_cast<in_t>(c);
}
in_t apply_ceil<in_t>(in_t input) {
return input value rounded up to nearest integer
}
in_t apply_clip_s<in_t>(in_t value, in_t min_val, in_t max_val) {
if (is_floating_point(in_t>) {
REQUIRE(min_val <= max_val);
}
else {
REQUIRE(sign_extend<int64_t>(min_val) <= sign_extend<int64_t>(max_val));
}
value = apply_max_s<in_t>(value, min_val);
value = apply_min_s<in_t>(value, max_val);
return value;
}
in_t apply_exp<in_t>(in_t input) {
return e to the power input
}
in_t apply_floor<in_t>(in_t input) {
return input value rounded down to nearest integer
}
in_t apply_log<in_t>(in_t input) {
if (input == 0) {
return -INFINITY
}
else if (input < 0) {
return NaN;
}
return the natural logarithm of input
}
in_t apply_logical_rshift<in_t>(in_t a, in_t b) {
uint64_t c = zero_extend<uint32_t>(a) >> zero_extend<uint32_t>(b);
return static_cast<in_t>(c);
}
in_t apply_max_s<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) {
if (isNaN(a) || isNaN(b)) {
return NaN;
}
if (a >= b) return a; else return b;
}
// Integer version
if (sign_extend<int64_t>(a) >= sign_extend<int64_t>(b)) return a; else return b;
}
in_t apply_min_s<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) {
if (isNaN(a) || isNaN(b)) {
return NaN;
}
if (a < b) return a; else return b;
}
// Integer version
if (sign_extend<int64_t>(a) < sign_extend<int64_t>(b)) return a; else return b;
}
in_t apply_mul_s<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) return a * b;
int64_t c = sign_extend<int64_t>(a) * sign_extend<int64_t>(b);
return static_cast<in_t>(c);
}
in_t apply_pow<in_t>(in_t a, in_t b) {
return a ** b; // a raised to the power b
}
in_t apply_sqrt<in_t>(in_t input) {
return the square root of input
}
in_t apply_sub_s<in_t>(in_t a, in_t b) {
if (is_floating_point(in_t)) return a - b;
int64_t c = sign_extend<int64_t>(a) - sign_extend<int64_t>(b);
REQUIRE(c >= minimum_s<in_t> && c <= maximum_s<in_t>);
return static_cast<in_t>(c);
}
in_t apply_sub_u<in_t>(in_t a, in_t b) {
uint64_t c = zero_extend<uint64_t>(a) - zero_extend<uint64_t>(b);
REQUIRE(c >= minimum_u<in_u_t> && c <= maximum_u<in_u_t>);
return truncate<in_t>(c);
}
int32_t count_leading_zeros(int32_t a) {
int32_t acc = 32;
if (a != 0) {
uint32_t mask;
mask = 1 << (32 - 1); // width of int32_t - 1
acc = 0;
while ((mask & a) == 0) {
mask = mask >> 1;
acc = acc + 1;
}
}
return acc;
}
4.3.2. Type Conversion Helpers
The following definitions indicate the type to be used when the given parameters are provided.
// Returns a signed version of the given type
// A no-op for floating-point types
Type make_signed(Type in_t)
{
switch(in_t) {
case bool_t:
return bool_t;
case i8_t:
return int8_t;
case i16_t:
return int16_t;
case i32_t:
return int32_t;
case i48_t:
return int48_t;
case fp16_t:
return fp16_t;
case bf16_t:
return bf16_t;
case fp32_t:
return fp32_t;
}
}
// Returns the usigned type of the given type
// Error to call this with anything but i8_t or i16_t
Type make_unsigned(Type in_t)
{
ERROR_IF(in_t != i8_t && in_t != i16_t);
switch(in_t) {
case i8_t:
return uint8_t;
case i16_t:
return uint16_t;
}
}
out_t static_cast<out_t>(in_t value)
{
// Operates similar to the c++ standard static_cast
// Limited to simple numeric conversion for TOSA.
// Sign extends signed integer input types if needed
// Zero extends unsigned integer input types if needed
// Truncates when converting to a smaller width data type
// Conversion from integer to floating-point is exact if possible
// If converting between signless integer types, treated as signed integer
}
out_t bitcast<out_t>(in_t value)
{
// Treats the bits of value as if they were of type out_t
// Only supported for integer types of the same bit width
}
4.3.3. Numeric Conversion Helpers
The following definitions are used in pseudocode to do numeric conversions. Where the float_t type is used, it represents all of the floating-point data types supported by the given profile. See [Number formats] for details on the floating-point formats.
int round_to_nearest_int(float_t f)
Converts the floating-point value to f, with rounding to the nearest integer value.
For the required precision see the section: Main inference precision requirements.
float_t round_to_nearest_float(in_t f)
Converts the input value into floating-point, rounding to the nearest representable value.
For the required precision see the section: Main inference precision requirements.
out_t sign_extend<out_t>(in_t input)
Floating point values are unchanged.
For two's complement integer values where out_t has more bits than in_t, replicate the top bit of input for all bits between the top bit of input and the top bit of output.
out_t zero_extend<out_t>(in_t input)
Floating point values are unchanged.
For two's complement integer values where out_t has more bits than in_t, insert zero values for all bits between the top bit of input and the top bit of output.
out_t truncate(in_t input)
output is the sizeof(out_t) least significant bits in input.
Nop for floating-point types
The following definition is used to flatten a list of lists into a single list.
in_t* flatten(in_t lists[]) {
in_t output = [];
for_each(list in lists) {
for_each(element in list) {
output.append(element);
}
}
}
Generic helper functions used to keep the pseudocode concise.
bool_t is_floating_point(type) {
if (type == fp16_t || type == fp32_t || type == bf16_t)
return true;
return false;
}
int32_t idiv(int32_t input1, int32_t input2) {
return input1 / input2; // Integer divide that truncates towards zero
}
// Integer division that checks input1 is a multiple of input2
int32_t idiv_check(int32_t input1, int32_t input2) {
ERROR_IF(input1 % input2 != 0); // input1 must be a multiple of input2
return input1 / input2; // exact quotient without rounding
}
int32_t length(in_t input)
return number of elements in input list
int32_t rank(in_t input)
return rank of an input tensor
int32_t sum(in_t input[])
return the sum of values of an input list
bool isNaN(float input)
return True if floating-point input value is NaN
float_t pi()
returns value of pi
float_t sin(angle)
return sine of angle given in radians
float_t cos(angle)
return cosine of angle given in radians
bool power_of_two(int32_t value)
return true if value is a power of two, false otherwise
in_out_t maximum_s<Type T>
return the maximum value when interpreting type T as a signed value as returned by the make_signed helper.
in_out_t minimum_s<Type T>
return the minimum value when interpreting type T as a signed value as returned by the make_signed helper.
in_out_t maximum_u<Type T>
return the maximum value when interpreting type T as an unsigned value as returned by the make_unsigned helper.
in_out_t minimum_u<Type T>
return the minimum value when interpreting type T as an unsigned value as returned by the make_unsigned helper.
5. Appendix A
Note | This appendix is at an early stage of development at this point in time |
5.1. Random data generation
The following function generates a pseudo-random floating-point value in the range -1.0 to +1.0 for use as test data. It uses a modulo (1<<32) recurrent sequence with multiplier derived from "TOSASETS" and the set number.
float set_data(uint32_t set, uint32_t index)
{
uint32_t m = (8*set + 1) * 0x705A5E75; // mod (1<<32) calculation
uint32_t r = m + 1; // mod (1<<32) calculation
for (uint32_t i = 0; i < index; i++) {
r = r * m + 1; // mod (1<<32) calculation
}
float sign = (r>>31)==0 ? +1 : -1;
return sign * (float)(r & 0x7FFFFFFF) / (float)(0x7FFFFFFF);
}
5.2. Main Inference test data generator
This section describes the function tosa_mi_data(S, KS, p, k, i) that generates test data for main inference compliance. This function takes the following arguments:
-
S is the test set number which identifies which generator is used
-
KS is the kernel size
-
p is the parameter number of:
-
0 for the first input (usually data)
-
1 for the second input (usually weights)
-
2 for the third input if present (usually bias)
-
-
k is the index within the kernel in the range 0 <= k < KS
-
i is the index within the tensor to write
Some test data values are scaled by the bound parameter B which is defined in the table below. B is set to be the largest value that is both representable by the input type and such that B*B does not overflow the accumulator precision.
inputs type | accumulator type | B value |
fp16 | fp16 | (1<<8) - (1/8) = 255.875 |
fp16 | fp32 | (1<<16) - (1<<5) = 65504 |
bf16 | fp32 | (1<<64) - (1<<56) |
fp32 | fp32 | (1<<64) - (1<<40) |
5.2.1. Test set S=0 generator
The aim of this generator is to check that sum of products with zero gives zero result.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | set_data(2*S, i) < 0 ? 0.0 : set_data(2*S+1, i) |
1 | set_data(2*S, i) < 0 ? set_data(2*S+1, i) : 0.0 |
2 | 0.0 |
5.2.2. Test set S=1
The aim of this test set is to check values with large exponents.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | (B/sqrt(KS+1))*(0.75 + 0.25*set_data(3*S+0, i)) |
1 | (B/sqrt(KS+1))*(0.75 + 0.25*set_data(3*S+1, i)) |
2 | (B*B/(KS+1))*(0.75 + 0.25*set_data(3*S+2, i)) |
5.2.3. Test set S=2
The aim of this test set is to check rounding error when accumulating small values onto a large value. In this case the small values are of similar magnitude. If the implementation changes the order of the sum, then the test data must also be reordered so that the largest values occur first in the sum.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | (k==0) ? 1.0 : set_data(2*S+0, i)/sqrt(KS) |
1 | (k==0) ? 1.0 : set_data(2*S+1, i)/sqrt(KS) |
2 | 0.0 |
5.2.4. Test set S=3
The aim of this test set is to check rounding error when accumulating small values onto a large value. In this case the small values are of varying magnitude. If the implementation changes the order of the sum, then the test data must also be reordered so that the largest values occur first in the sum.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | (k==0) ? 16.0 : exp(2*set_data(2*S+0, 2*i+0)) * set_data(2*S+0, 2*i+1) |
1 | (k==0) ? 16.0 : exp(2*set_data(2*S+1, 2*i+0)) * set_data(2*S+1, 2*i+1) |
2 | 0.0 |
5.2.5. Test set S=4
The aim of this test set is to check a mixture of zero and non-zero products.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | (k==KS/2) ? +0.5 : (set_data(2*S, i) < 0 ? 0.0 : (B/sqrt(KS))*set_data(2*S+1, i)) |
1 | (k==KS/2) ? -0.5 : (set_data(2*S, i) < 0 ? (B/sqrt(KS))*set_data(2*S+1, i) : 0.0) |
2 | 0.0 |
5.2.6. Test set S=5
The aim of this test set is to check signed inputs of large range.
p | tosa_mi_data(S, KS, p, k, i) = |
0 | (B/sqrt(KS+1))*set_data(3*S+0, i) |
1 | (B/sqrt(KS+1))*set_data(3*S+1, i) |
2 | (B*B/(KS+1))*set_data(3*S+2, i) |
5.3. Main Inference operator test data
For each operator, this section defines how to generate test data for test set S. For the results to be statistically significant the operation must calculate at least MIN_DOT_PRODUCTS dot products. For most operations this means that the output tensor must have at least MIN_DOT_PRODUCTS output values. For most operations batch size can be increased if necessary so that this holds. For this version of the specification, MIN_DOT_PRODUCTS is set to 1000.
5.3.1. CONV2D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OH*OW*OC >= MIN_DOT_PRODUCTS
KS = KW*KH*IC;
for (0 <= n < N, 0 <= iy < IH, 0 <= ix < IW, 0 <= ic < IC) {
input [ n, iy, ix, ic] = tosa_mi_data(S, KS, 0, ((iy % KH)*KW+(ix % KW))*IC+ic, ((n*IH+iy)*IW+ix)*IC+ic);
}
for (0 <= oc < OC, 0 <= ky < KH, 0 <= kx < KW, 0 <= ic < IC) {
weight[oc, ky, kx, ic] = tosa_mi_data(S, KS, 1, (ky*KW+kx)*IC+ic, ((oc*KH+ky)*KW+kx)*IC+ic);
}
for (0 <= oc < OC) {
bias[oc] = tosa_mi_data(S, KS, 2, oc)
}
5.3.2. CONV3D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OD*OH*OW*OC >= MIN_DOT_PRODUCTS
KS = KD*KW*KH*IC;
for (0 <= n < N, 0 <= id < UD, 0 <= iy < IH, 0 <= ix < IW, 0 <= ic < IC) {
input [ n, id, iy, ix, ic] = tosa_mi_data(S, KS, 0, (((id % KD)*KH+(iy % KH))*KW+(ix % KW))*IC+ic, (((n*ID+id)*IH+iy)*IW+ix)*IC+ic);
}
for (0 <= oc < OC, 0 <= kd < KD, 0 <= ky < KH, 0 <= kx < KW, 0 <= ic < IC) {
weight[oc, kd, ky, kx, ic] = tosa_mi_data(S, KS, 1, ((kd*KH+ky)*KW+kx)*IC+ic, (((oc*KD+kd)*KH+ky)*KW+kx)*IC+ic);
}
for (0 <= oc < OC) {
bias[oc] = tosa_mi_data(S, KS, 2, oc)
}
5.3.3. DEPTHWISE_CONV2D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OH*OW*C*M >= MIN_DOT_PRODUCTS
KS = KW*KH;
for (0 <= n < N, 0 <= iy < IH, 0 <= ix < IW, 0 <= c < C) {
input [ n, iy, ix, c] = tosa_mi_data(S, KS, 0, (iy % KH)*KW+(ix % KW), ((n*IH+iy)*IW+ix)*C+c);
}
for (0 <= ky < KH, 0 <= kx < KW, 0 <= c < C, 0 <= m < M) {
weight[ky, kx, c, m] = tosa_mi_data(S, KS, 1, (ky*KW+kx), ((ky*KW+kx)*C+c)*M+m);
}
for (0 <= oc < C*M) {
bias[oc] = tosa_mi_data(S, KS, 2, oc)
}
5.3.4. FULLY_CONNECTED
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OC >= MIN_DOT_PRODUCTS
KS = IC;
for (0 <= n < N, 0 <= ic < IC) {
input [ n, ic] = tosa_mi_data(S, KS, 0, ic, n*IC+ic);
}
for (0 <= oc < OC, 0 <= ic < IC) {
weight[oc, ic] = tosa_mi_data(S, KS, 1, ic, oc*IC+ic);
}
for (0 <= oc < OC) {
bias[oc] = tosa_mi_data(S, KS, 2, oc)
}
5.3.5. MATMUL
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*H*W >= MIN_DOT_PRODUCTS
KS = C;
for (0 <= n < N, 0 <= y < H, 0 <= c < C) {
A[n, y, c] = tosa_mi_data(S, KS, 0, c, (n*H+y)*C+c);
}
for (0 <= n < N, 0 <= c < C, 0 <= x < W) {
B[n, c, x] = tosa_mi_data(S, KS, 1, c, (n*C+c)*W+x);
}
5.3.6. TRANSPOSE_CONV2D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OH*OW*OC >= MIN_DOT_PRODUCTS
KS = KW*KH*IC;
for (0 <= n < N, 0 <= iy < IH, 0 <= ix < IW, 0 <= ic < IC) {
input [ n, iy, ix, ic] = tosa_mi_data(S, KS, 0, ((iy % KH)*KW+(ix % KW))*IC+ic, ((n*IH+iy)*IW+ix)*IC+ic);
}
for (0 <= oc < OC, 0 <= ky < KH, 0 <= kx < KW, 0 <= ic < IC) {
weight[oc, ky, kx, ic] = tosa_mi_data(S, KS, 1, (ky*KW+kx)*IC+ic, ((oc*KH+ky)*KW+kx)*IC+ic);
}
for (0 <= oc < OC) {
bias[oc] = tosa_mi_data(S, KS, 2, oc)
}
5.3.7. FFT2D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*H*W >= MIN_DOT_PRODUCTS
KS = 2*H*W;
for (0 <= n < N, 0 <= y < H, 0 <= x < W) {
input_real[n, y, x] = tosa_mi_data(S, KS, 0, y*W+x, ((0*N+n)*H+y)*IW+x);
input_imag[n, y, x] = tosa_mi_data(S, KS, 0, y*W+x, ((1*N+n)*H+y)*IW+x);
}
for (0 <= y < H, 0 <= x < W, 0 <= m < H, 0 <= n < W) {
weight_real[y, x, m, n] = real(exp(2*pi*i*((m*h/H) + (n*w/W))));
weight_imag[y, x, m, n] = imag(exp(2*pi*i*((m*h/H) + (n*w/W))));
}
5.3.8. REDUCE_SUM
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: tensor_size(shape) >= MIN_DOT_PRODUCTS
KS = shape1[axis];
for (index in shape1) {
input[index] = tosa_mi_data(S, KS, 0, index[axis], tensor_index_to_offset(index));
}
for (0 <= c < KS) {
weight[c] = 1;
}
5.3.9. AVG_POOL2D
The following generates input test data for test set S. For compliant implementation, the test must pass whenever the attributes satisfy: N*OH*OW*C >= MIN_DOT_PRODUCTS
KX = kernel_x;
KY = kernel_y;
KS = KX*KY;
for (0 <= n < N, 0 <= iy < IH, 0 <= ix < IW, 0 <= c < C) {
input [ n, iy, ix, c] = tosa_mi_data(S, KS, 0, ((iy % KY)*KX+(ix % KX))*C+c, ((n*IH+iy)*IW+ix)*C+c);
}
for (0 <= ky < KY, 0 <= kx < KX, 0 <= c < C, 0 <= m < M) {
weight[ky, kx] = 1/KS;
}