

# A Fast and Compact RISC-V Accelerator for Ascon and Friends

**Stefan Steinegger and Robert Primas** 

CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

• Reduce area overhead compared to dedicated co-processors

Stefan Steinegger — CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

- Reduce area overhead compared to dedicated co-processors
- Select versatile building block
  - Usable for authenticated encryption, hashing, PRNG,...

- Reduce area overhead compared to dedicated co-processors
- Select versatile building block
  - Usable for authenticated encryption, hashing, PRNG,...
- Versatile building block ASCON-p
  - $\bullet~$  Used by  $\ensuremath{\operatorname{Ascon}}$  and  $\ensuremath{\operatorname{Isap}}$
- $\bullet$  Accelerator with  $\operatorname{ISAP}$  mode
  - AEAD with hardening against implementation attacks
  - Desired feature in NIST LWC competition
  - Covers both power analysis and fault attacks
  - No need for protected hardware building blocks

# Background

• Sponge-based AEAD scheme

Stefan Steinegger - CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category
- Also competes in NIST LWC competition

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category
- Also competes in NIST LWC competition
- 320-bit state in 5  $\times$  64 bit lanes

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category
- Also competes in NIST LWC competition
- 320-bit state in  $5 \times 64$  bit lanes
- Rate of 64 or 128 bits

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category
- Also competes in NIST LWC competition
- 320-bit state in 5  $\times$  64 bit lanes
- Rate of 64 or 128 bits
- Permutation function ASCON-*p*

- Sponge-based AEAD scheme
- First choice in CAESAR's lightweight category
- Also competes in NIST LWC competition
- 320-bit state in 5  $\times$  64 bit lanes
- Rate of 64 or 128 bits
- Permutation function ASCON-*p*
- Authenticated encryption, hashing, PRNG

#### **Ascon Encryption**

- ASCON-p is called in  $p^a = 12$  and  $p^b = 6$  or 8 times
- Hashing works in a similar manner



Stefan Steinegger --- CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

• Mode for authenticated encryption

Stefan Steinegger - CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

- Mode for authenticated encryption
- Mode-level hardening against various physical attacks
  - DPA, DFA, SFA, SIFA

- Mode for authenticated encryption
- Mode-level hardening against various physical attacks
  - DPA, DFA, SFA, SIFA
- Two-pass scheme

- Mode for authenticated encryption
- Mode-level hardening against various physical attacks
  - DPA, DFA, SFA, SIFA
- Two-pass scheme
- Currently in the 2<sup>nd</sup> round of the NIST LWC competition

- Mode for authenticated encryption
- Mode-level hardening against various physical attacks
  - DPA, DFA, SFA, SIFA
- Two-pass scheme
- Currently in the 2<sup>nd</sup> round of the NIST LWC competition
- 2/4 parametrizations based on ASCON-p

- Mode for authenticated encryption
- Mode-level hardening against various physical attacks
  - DPA, DFA, SFA, SIFA
- Two-pass scheme
- Currently in the 2<sup>nd</sup> round of the NIST LWC competition
- 2/4 parametrizations based on ASCON-p

- Transforms state in 3 steps:
  - Round constant addition

- Transforms state in 3 steps:
  - Round constant addition
  - Substitution layer

- Transforms state in 3 steps:
  - Round constant addition
  - Substitution layer
  - Linear layer

- Transforms state in 3 steps:
  - Round constant addition
  - Substitution layer
  - Linear layer
- Versatile and flexible:
  - ISAP-A-128A, ISAP-A-128
  - ASCON, ASCON-HASH, ASCON-XOF

www.tugraz.at

• Typically co-processor designs

Stefan Steinegger — CARDIS 2020, 19th Smart Card Research and Advanced Application Conference



- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)

Stefan Steinegger — CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:
  - Reuse parts of register file, implement only comb. logic of ASCON-p

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:
  - Reuse parts of register file, implement only comb. logic of ASCON-p
  - 320-bits  $\rightarrow$  ten 32 bit or five 64 bit registers
  - ARMv7 and RISC-V RV32E: 16 registers

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:
  - Reuse parts of register file, implement only comb. logic of ASCON-p
  - 320-bits  $\rightarrow$  ten 32 bit or five 64 bit registers
  - ARMv7 and RISC-V RV32E: 16 registers
  - State fits in CPU registers (less area)

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:
  - Reuse parts of register file, implement only comb. logic of ASCON-p
  - 320-bits  $\rightarrow$  ten 32 bit or five 64 bit registers
  - ARMv7 and RISC-V RV32E: 16 registers
  - State fits in CPU registers (less area)
  - Integrate tightly as new custom instruction

- Typically co-processor designs
  - Loosely coupled and connected over a bus (e.g. AXI)
  - Internal sets of registers (more area)
  - Great for lots of processing on very little data
  - Less flexible
- Our approach:
  - Reuse parts of register file, implement only comb. logic of ASCON-p
  - 320-bits  $\rightarrow$  ten 32 bit or five 64 bit registers
  - ARMv7 and RISC-V RV32E: 16 registers
  - State fits in CPU registers (less area)
  - Integrate tightly as new custom instruction
  - Mode remains entirely in software
  - $\bullet\,$  Basic building block for  ${\rm Ascon}$  and  ${\rm Isap}$

#### Building an Accelerator cont.

- Add ASCON-p on RI5CY/CV32E40P core
- 32-bit RISC-V CPU implementing RV32IM[F]C ISA
  - 32 general purpose registers
  - Additional instructions for DSP
  - 4-stage pipeline

#### Building an Accelerator cont.

- Add ASCON-p on RI5CY/CV32E40P core
- 32-bit RISC-V CPU implementing RV32IM[F]C ISA
  - 32 general purpose registers
  - Additional instructions for DSP
  - 4-stage pipeline
- Necessary changes:
  - Instruction encoding
  - Register file adaptations
  - Decode stage adaptations

• I-type instruction

- I-type instruction
- 12-bit Immediate encodes
  - Round Constant (8 bit)
  - Number of rounds (3 bit)
  - Endianess (1 bit)



• ASCON-*p* requires ten 32-bit registers

- ASCON-*p* requires ten 32-bit registers
- Solution:
  - Dynamic register selection not strictly needed
  - Define fixed subset as ASCON-*p* registers
  - $\bullet\,$  Toggle between r/w port and accelerator

- ASCON-*p* requires ten 32-bit registers
- Solution:
  - Dynamic register selection not strictly needed
  - Define fixed subset as ASCON-*p* registers
  - $\bullet$  Toggle between r/w port and accelerator
- Each ASCON-p 64-bit lanes split to two registers



• Add instruction to decoder

Stefan Steinegger - CARDIS 2020, 19th Smart Card Research and Advanced Application Conference

www.tugraz.at

- Add instruction to decoder
- Add single cycle ASCON-*p* accelerator

- Add instruction to decoder
- Add single cycle ASCON-*p* accelerator
- ASCON-*p* instruction:
  - Enables accelerator
  - Toggles register inputs to accelerator output

- Add instruction to decoder
- Add single cycle ASCON-*p* accelerator
- ASCON-*p* instruction:
  - Enables accelerator
  - Toggles register inputs to accelerator output
- Limitations:
  - ALU and load-store unit forward results to next instruction

- Add instruction to decoder
- Add single cycle ASCON-*p* accelerator
- ASCON-*p* instruction:
  - Enables accelerator
  - Toggles register inputs to accelerator output
- Limitations:
  - ALU and load-store unit forward results to next instruction
  - $\bullet \ \rightarrow \mathrm{Ascon-}{\it p}$  registers must not be altered the cycle before
  - $\bullet \ \rightarrow \mathrm{Ascon-}{\it p}$  registers must not be loaded to two cycles before



• Main data processing loop





Plaintext

### Performance

**Table 1:** RISC-V RI5CY: Runtime and code size comparison of ASCON and ISAP, with/without 1-round ASCON-*p* hardware acceleration (HW-A)

| Inglanantations        | Cycles/Byte |        |       | Dinam Sina (D)  |
|------------------------|-------------|--------|-------|-----------------|
| Implementations        | 64 B        | 1536 B | long  | Binary Size (B) |
| Ascon-C (-O3)          | 164.3       | 110.6  | 108.3 | 11716           |
| Ascon-C (-Os)          | 269.7       | 187.1  | 183.5 | 2104            |
| Ascon-ASM + HW-A       | 4.2         | 2.2    | 2.1   | 888             |
| AsconHash-ASM + HW-A   | 4.6         | 2.6    | 2.5   | 484             |
| ISAP-A-128a-C (-O3)    | 1 184.3     | 386.9  | 352.3 | 11 052          |
| ISAP-A-128a-ASM + HW-A | 29.1        | 5.2    | 4.2   | 1844            |

#### Table 2: Area comparison of the RISC-V RI5CY core and various co-processor designs

|                                              | kGE         |             |  |
|----------------------------------------------|-------------|-------------|--|
| Design                                       | Standalone  | Integration |  |
| RI5CY base design                            | 45.6        | -           |  |
| This work                                    | 4.2         | 0.5         |  |
| ASCON co-processor [Gro+15]                  | 7.1         | ?           |  |
| ASCON co-processor [Gro]                     | 9.4         | ?           |  |
| $\rm ISAP$ co-processor (estimated) [Dob+19] | $\leq$ 12.8 | ?           |  |

### Implementation Security of ISAP

- ISAP offers protection/hardening against DPA, DFA, SFA, SIFA
  - See specification for details

- ISAP offers protection/hardening against DPA, DFA, SFA, SIFA
  - See specification for details
- SPA is possible but limited by highly parallel computation in the accelerator

- ISAP offers protection/hardening against DPA, DFA, SFA, SIFA
  - See specification for details
- SPA is possible but limited by highly parallel computation in the accelerator
- With accelerator state setup most lucrative target.
- Localized EM analysis could allow successful SPA
  - Add e.g. shuffling

- ISAP offers protection/hardening against DPA, DFA, SFA, SIFA
  - See specification for details
- SPA is possible but limited by highly parallel computation in the accelerator
- With accelerator state setup most lucrative target.
- Localized EM analysis could allow successful SPA
  - Add e.g. shuffling
- Analyzed in detail in the paper

## Summary

• ASCON-*p* presents a versatile building block

- ASCON-*p* presents a versatile building block
- Acceleration of one block can be used for a broad range of cryptographic primitives
  - With and without implementation security

- ASCON-*p* presents a versatile building block
- Acceleration of one block can be used for a broad range of cryptographic primitives
  - With and without implementation security
- One can get more performance \*AND\* better protection from implementation attacks.

# Thank you!

- [Dob+19] C. Dobraunig, M. Eichlseder, S. Mangard, F. Mendel, B. Mennink, R. Primas, and T. Unterluggauer. ISAP v2.0. Submission to the NIST Lightweight Crypto Competition. https://csrc.nist.gov/CSRC/media/Projects/lightweightcryptography/documents/round-2/spec-doc-rnd2/isap-spec-round2.pdf. 2019.
- [Gro] H. Groß. CAESAR Hacrdware API reference implementation. https://github.com/IAIK/ascon\_ hardware/tree/master/caesar\_hardware\_api\_v\_1\_0\_3/ASCON\_ASCON (accessed 12/2019). (Visited on 12/2019).
- [Gro+15] H. Groß, E. Wenger, C. Dobraunig, and C. Ehrenhöfer. Suit up! Made-to-Measure Hardware Implementations of ASCON. In: DSD. IEEE Computer Society, 2015, pp. 645–652.