Acceleration of Image Processing with SHA-3 (Keccak) Algorithm using FPGA

: In our digital world, the transmission of images between people has played an essential part in everyday communication. As a result, procedures to ensure the integrity and accuracy of the communicated data are required. Today, hashing is the most popular and secure way. This article focuses on the SHA-3 for hashing images dimensions 256 × 256 pixels with our custom implementations on the FPGA based on the Very High Speed Integrated Circuit Hardware Description Language (VHDL). We perform our experiments on the Intel Arria 10 GX FPGA and the Nios II processor. Also, our experiments with calculating metrics such as entropy, NPCR and UACI show that the SHA-3 is secure, reliable and has high application potential for hashing images. We propose designs to improve throughput, security, and eﬃciency criteria. We strengthened our design using the IP Block Floating Point Hardware 2 (FPH-2). Our experiments with the proposed implementation have shown increased throughput by 14.38% and eﬃciency by 13.95% of the SHA-3 algorithm. Finally, we compared our ﬁndings to other researchers’ existing optimization methodologies, giving data that demonstrate our research’s strengths.


Introduction
As well as for any other transmitted information, the integrity of the image transfer is achieved via cryptographic hashing functions. An essential role in today's world of digital transmissions plays cryptographic hash functions. It is an essential technology used to protect information integrity when information is transmitted over a grid. Nowadays, image information security is crucial, mainly in the army, meteorology, medicine, intelligent robots, commerce, etc. As a result, the cryptographic society's mission has become the creation of an image hash feature [1]- [3].
Watermarking is the technique for guarding digital images and video against alterations or corruption. Hash features can be successfully used in range authentication and image watermark applications [4]. In expansion, a picture hash procedure would significantly simplify investigations. Moreover, hashing is utilized within comparisons in vast databases, in which a lot of similar arrangements of an image can exist [5].
In this paper, we developed and implemented the famous Keccak (SHA3-256) algorithm in the Intel Arria 10 GX FPGA board. We utilised the new algorithm SHA-3 with a 256-bits output size because it provides high safety and maintains the original image quality during the hash process. We provide a FPH-2-based approach in our tailor-made design. We compare the two strategies we have designed with other similar models and with standard evaluation criteria (entropy, Unified Averaged Changed Intensity (UACI), Number of Pixel Changing Rate (NPCR), efficiency and throughput).
The main contributions of our work are: • We suggest a novel two-stage pipelined design for the SHA-3 algorithm in 256 bits output length for 256×256 pixel images, optimizing FPGA devices' acceleration and performance. We have used SHA-3 with a 256-bits output length because it provides high security.
• We contribute an innovative procedure established on the FPH-2 element in our design, which delivers an inferior cycle count. We analysed the optimisation plan to maximise the throughput and efficiency measures, and at the same time, algorithm SHA-3 keeps the actual image quality.
algorithm. The objective of all these models is to enhance performance while at the same period trying to decrease power consumption and area on the FPGA board. An efficient and secured image encryption algorithm is proposed in [6], jointly using the SHA-3 hash function with two-dimensional Arnold chaotic maps. In the permutation step, a conventional encryption technique is described with four random shuffling rules to avoid time consumption in the pixel position index sorting phase. Numerical findings reveal that the proposed encryption technique may improve security and speed up the implementation of digital picture transmission.
On work [7], the authors focus on 256 × 256 grayscale image encryption. The implementation was done using VHDL. The results show that the proposed architecture for the SHA3-256 algorithm achieved a throughput of 35.593 Gbps, maximum frequency of 458 MHz, area (slices) 2.984 and efficiency of 11.92 Mbps/Slices. The authors in [8] suggest a new implementation with a chaotic encryption algorithm for images in dual chaotic maps. The SHA-3 and an auto updating system calculate the hash values to construct a Logistic map's control parameter and initial condition. Behind that, all the permutations are executed for rows and columns in an image to exchange pixels. As an effect, the presented algorithm can oppose known-plaintext attacks efficiently.
On work [9], the authors focus on all candidates in the SHA-3 competition in terms of their effectiveness in the area (slices). Their research was conducted with the Virtex-5 and Virtex-6 FPGA devices. The implementation was done using VHDL. Their architecture for the SHA3-256 algorithm achieved better results with the Virtex-6 device with a throughput of 1.071 Gbps, maximum frequency of 197 MHz, area (slices) 397 and efficiency of 2.69 Mbps/Slices. The authors in [10] focus on all candidates in the SHA-3 finalists in the FPGA. The main goal of the research is to analyze the performance of all candidates in terms of throughput and area. In their work, they used a Virtex 5 and Virtex 6 from Xilinx and Stratix III and Stratix IV from Intel and the implementation was done using VHDL. The results show that the proposed architecture for the SHA3-256 (Keccak) algorithm achieves better results with the Virtex-6 device. They achieved a throughput of 16.236 Gbps, area (slices) 1.446 and efficiency of 11.23 Mbps/Slices.
In [11], the authors work on the assessment of all SHA-3 finalists in FPGA devices. The primary goal of their work is to compare all candidates with the evaluation criteria of throughput, clock frequency and area. Their research was conducted with Virtex-5, 6 and 7 FPGA devices. The implementation was done using VHDL. The results show that the proposed design for the SHA-3 algorithm achieves better results with the Virtex-5 device in clock frequency, region and performance than other candidates.
The authors in [12] deal with the performance implementation of all SHA-3 finalists in the FPGA. The main goal of the research is to provide a fair and comprehensive evaluation of all candidates in terms of throughput and area. Their work used a Xilinx Virtex-5 and Virtex-6 device, and the implementation was done using VHDL. Their architecture for the SHA3-256 (Keccak) algorithm with the Virtex-6 device achieved a throughput of 12.817 Gbps and efficiency of 10.08 Mbps/Slices, a maximum frequency of 282.7 MHz and an area (slices) of 1.272.
In [13], the authors investigated the calculatedly efficiency of all SHA-3 finalists in FPGA devices. The primary purpose of this study is to compare the efficacy of this design in terms of fragmented functions per unit area. The work was done using a Virtex-5 FPGA chip with VHDL as the implementation language. The suggested design for the SHA3-256 (Keccak) algorithm requires 1.117 slices (area), reaches a maximum frequency of 189 MHz, and has a throughput of 6.263 Gbps and an efficiency of 3.17 Mbps/Slices, according to the results.
The authors in [14] deal with the effective implementation of all SHA-3 finalists in the FPGA. The main goal of the research is to provide a basic comparison between all candidates in terms of clock frequency, throughput and area. They used a Xilinx FPGA device in their work, and the implementation was done using VHDL. Their architecture for the SHA3-256 algorithm achieved a throughput of 11.9 Gbps, a maximum frequency of 215 MHz, and an area (slices) of 4.745.
On work [15], the authors suggested a pipelining architecture for the SHA-3 algorithm in order to raise its efficiency and throughput of them. The proposed architectures were implemented in FPGA Virtex-2, Spartan-3 and Virtex-4 using Verilog. According to the experimental findings, the suggested designs provide excellent performance with the Virtex-4 device in terms of total area, maximum frequency, throughput, and throughput/area.
All of these documents and many more [16]- [25], have as their main goal the increase of its throughput and efficiency metrics in the SHA3 (Keccak) algorithm. However, improved architecture is always needed to enhance throughput and efficiency. Compared to previous works, we designed and implemented two designs of the SHA3, using the Nios II/f processor. Our first design applies a two-stage pipeline architecture. The second concerns a method based on the FPH-2 part in a two-stage pipeline design. The two approaches we suggest in this paper deliver a secure SHA3 with 256bits hashing implementation. The proposed design with the FPH-2 component and the two-stage pipelined architecture outperforms existing implementations.

Implementation for Image Hashing
This section analyses all the design components we have implemented for the SHA3-256 (Keccak). In our experiments, we have used the Standard Edition (SE) Quartus II ver. 18.3 and the DE5a-Net board. Table 1 displays the specifications of the Terasic DE5a-Net board.

Nios II -Soft-Core Embedded Processor
The Nios II is wholly implemented in the FPGA. It is considered suitable for most embedded applications and provides flexibility for real-time and cost-sensitive functionality [26]. Nios II is offered in three different configurations: fast, standard and economy. The Nios fast is optimized for the most high performance; this performance can be modified using patronage instructions, hardware accelerators, and the highest bandwidth switch fabric. The Nios standard is used for increased performance, and Nios economy is appropriate for mediocre performance [27]. In our experimentations, we utilised the processor NIOS II/f, as shown in Figure 1. Its main characteristics are operation with a 6-stage pipeline to gain the external interrupt controller, custom instructions, highest DMIPS/MHz and optional hardware multiply to improve arithmetic performance [28,29].

Nios II Custom Instruction Implementation
Custom instructions provide us with the capability to feet the Nios II processor to complete the requirements of an application. A custom instruction logic block interfaces with the Nios II processor through 3 ports: data a , data b , and result. The custom instruction obtains input on its data a port and data b ports and drives the final results to its result port. A conduit interface to external logic provides a custom interface to method resources exceeding the Nios II processor. A custom combination statement complements its logical function in a single clock cycle. Custom multi-cycle instructions require two or more time cycles to operate. An extended custom instruction allows the implementation of several different operations. An Internal register file allows to access the Nios II for input or output or both [30]. Figure 2 shows a block graph with all ports of a Nios II custom instruction.

Floating Point Hardware 2 (FPH-2)
We may choose to avoid the floating-point divider because it takes more resources than other instructions. If Nios II does not employ floating-point division, we may choose to do so. We can rearrange our code in some cases to reduce or even eliminate separated processes.  Minimum, maximum, negate, absolute, and comparisons are all provided via the special instruction implementations. FPH-2 is preferred over FPH-1 legacy because it has a lower clock cycle count, better acceleration, and a smaller area. In addition, the FPH-2 component helps with FPH-1 procedures and rounding accuracy, which is not an IEEE 754-defined rounding mode [30]. The floating functions performed by each custom orders are listed in Table 2.

System Design of the SHA3-256 Core
FPH-2 is supported by the Nios II architecture. Low cycle count implementations are possible with the FPH-2 component. Addition, subtraction, square root integer to float conversion, multiplication, float to integer conversion, and division are the most common floating point custom instructions. The SHA3-256 proposed system is depicted in the control unit. The Control Unit signal enables the meter. The Keccak RC is described in the following subsection. Figure 3: The design of the whole system for the SHA3-256 Core.

The SHA-3 Pipelined Design
The SHA-3 has 24 modification phases, each of which is made up of five phases: θ, ρ, π, χ and ι, signified as theta, rho, pi, chi, and iota respectively. SHA-3 (Keccak) takes the state array per step and produces a newly updated state array after using the related state function. Figure 4 shows the Keccak Round's two-staged pipelined design. There is a 2-in-1 multiplexer for the round's feedback at the start of the round. In each round, we use the pipelined approach to enter two registers. Between portions, the first register is located θ and ρ in order to separate the crucial path by nearly half. The second register is located just before the feedback unit at the end of the round. The clock and reset are the control signals of the two registers. The RC is put in the ι procedure produced by the RC generator and is shown in Table 3.

System Integration
The original grayscale image had a resolution of 256 × 256 pixels. The SDRAM memory stores the input block. The block is then fed into the SHA-3 core as input. The SHA-3 core's output block is saved in SDRAM memory. VHDL was used to implement all of the components. Using a variety of test benches, we inspected each VHDL file to ensure its validity and usefulness. The ModelSim 10.6d simulator was used to run all of the tests on each VHDL file, with valid input data sheets given by NIST for the SHA-3 algorithm in [31].
In addition, we used ModelSim 10.6d to simulate the top module using legitimate input examples for the SHA-3 algorithm provided by NIST in [32]. We moved on to the design of the Nios II CPU after correctly verifying the simulation outcomes in ModelSim 10.6d.
0x0000000000000088 RC 21 0x8000000000008080 RC 10 0x0000000080008009 RC 22 0x0000000080000001 RC 11 0x000000008000000A RC 23 0x8000000080008008  The designer platform was used to create the Nios II processor's scheme. We utilised the Nios II fast soft-core, which has a high-performance speed and maximises the processor core's f MAX performance. Clock, On-chip RAM, controller of SDRAM, a counter of performance, PLL, Peripheral ID System, JTAG-UART, and custom component SHA-3-256 are among the Nios II system's implemented components. The operating memory for the Nios II CPU is on-chip RAM. As demonstrated in Figure 5, all information is sent from Nios II to the SHA-3 feature via the Avalon Switch Fabric. Figure 6 displays the whole structure of our architecture that we built using the Nios II soft-core.

Experimental Results
The test were carried out using the Arria 10 GX FPGA. We designed a novel two-staged pipelined design with the FPH-2 component and a two-staged pipelined design. Figure 7 (b) shows the histogram of the classical images ("Lena", "Camera man" and "Pepper") in Figure 7 (a). In the histogram, the horizontal axis denotes the gray level, and the vertical axis denotes the pixel number of each gray level. After being encrypted by the SHA3-256 (Keccak) algorithm, the histogram of the cipher-image is completely uniform and absolutely different from that of the plain-image as shown in Figure 7 (d).

Entropy Analysis
The entropy of a photograph is a statistical metric for determining how random a coded image is. It also describes the median information of an image origin. The entropy E(X) is calculated in (1), where X represents the test photo, x i symbolizes the cost in X, and Pr(x i ) indicates the chance of X = x i . The entropy of a large number of hashed photos was calculated. The results are presented in Table 4, which shows that the hashed image entropy's are extremely near to 8. For a 256 gray-scale photo, the max entropy is log 2 (256) = 8. As a result, the suggested picture hashing approach has a high resistance against entropy attacks.

Correlation Analysis
Pixels should have a strong neighborhood correlation, which is one of the most important properties of an image. For the design to be considered secure and effective, there must be no correlation between pixels in an encrypted image. The correlation coefficient is given by in (2), where x i and y i is a pair of neighboring pixels that are horizontally, vertically, and diagonally adjacent, M signifies the total number of neighboring pixel pairs.
(2) Table 5 shows the correlation coefficients in the three orientations, demonstrating that the encrypted image correlation coefficients are very close to 0. As a result, the suggested model is resistant to statistical attacks.

NPCR and UACI Metrics Analysis
We use the Number of Pixel Change Rates (NPCR) and Unified Average Changing Intensity (UACI) to calculate the result of switching one pixel in both plain and hashed photos. [33]. The NPCR measures the number of individual pixels between the two images, and the UACI measures the average intensity. The NPCR is computed using (3), where D represents the bipolarity array with comparable size as the prototype image and hashes image, M × N define the size of the picture.
The UACI calculated using (4), where C 1 denotes the original image, C 2 is the hashed picture and M × N define the size of the picture in pixels.
The findings of the NPCR and the UACI are shown in Table 6. The high values of the NPCR and UACI measurements imply that hashing is more secure and more resistant to differential assaults.

Throughput and Efficiency Metrics
The throughput (TH) is computed using (5). In the (5), Number of bits is the bitrate size r, frequency is the maximum frequency reported by the tool and Number of clock cycles denote the latency of the circuit. Clock cycles represent the number of resumption needed of the five functions θ, ρ, π, χ and ι to generate the hash value.
The efficiency (EF) is computed by using (6).
The findings of our two designs for the SHA3-256 (Keccak) algorithm are shown in Table 7. The number of clock cycles of the five functions in a two-staged pipelined design is 18, while the number of clock cycles in a two-staged pipelined design with the component FPH-2 is 14.
Since the number of clock cycles is reduced and the maximum clock frequency increases, the proposed design of a two-staged pipelined architecture with FPH-2 provides the highest efficiency and throughput.  Table 8 presents the comparison with other similar architectures, taking into account their best implementation in terms of the criteria of throughput and efficiency for the SHA3-256 (Keccak) algorithm. When using the component FPH-2 to implement the proposed design, the area was raised by 10.30% (slices), but the maximum clock improved by 10.92% (frequency) and increased by 12.85% the number of clock cycles, resulting in a 14.38% increase in throughput and a 13.95% increase inefficiency.
Researchers in the works [9,11,12,13,20,21,22] show a smaller area compared to our implementations, but the frequency they achieve is lower than our experimental applications. Also, in the work [15] there is a higher frequency than the one we achieved, but they show a large increase  in the area. Finally, in the works [7,14,18] the researchers show a larger area and smaller frequency than we achieved with our architectures. In our architectures, the primary purpose was not to use an excessive growth of the cost of the area (Slices) so that the throughput (Gbps) and efficiency (Mbps/Slices) are not burdened.

Conclusions and Future Work
The optimal performance of hashing images with a size of 256 × 256 pixels using the SHA-3 algorithm with the Nios II/f (fast) soft-core processor in the FPGA Intel Arria 10 GX is presented in this study. We choose the SHA-3 algorithm, which has a 256-bit output length, because it provides the best security and performance. Our testing using the proposed two-staged pipelined design and the bespoke FPH-2 component revealed that the SHA-3 algorithm had a 14.38% percent improvement in throughput and a 13.95% percent gain inefficiency. At the same time, we increased the minimum area by 10.30% (slices), the max clock signal by 10.92% (frequency), and by 12.85% the number of clock cycles. The suggested approach combines speed, performance, and security to produce the optimum solution for hashing images with a dimension of 256 × 256 pixels.
In the future, we'll experiment with picture hashing using Tree Hashing and a simpler design with fewer rounds (12 instead of the 24 in SHA-3).