Lightweight Single Image Super-Resolution by Channel Split Residual Convolution |
32×Cct/ms | 64×C ct/ms | 128×Cct/ms | |
---|---|---|---|
3×3 conv | 0.090 | 0.271 | 1.105 |
ReLU | 0.034 | 0.066 | 0.131 |
Sigmoid | 0.036 | 0.070 | 0.139 |
Bias_add/add | 0.032 | 0.067 | 0.141 |
concat | 0.041 | 0.042 | 0.045 |
mul | 0.049 | 0.095 | 0.188 |
“C” represents the channel and “ct” represents the computation time cost.
Before designing a compact and lightweight network, we conducted experiments on the time cost of the fundamental operation of a convolutional neural network. These experiments use the TensorFlow framework and the TensorFlow Timeline tool to calculate time cost. All computation times in Table 1 are in milliseconds. From Table 1, we find that convolution is the most time-consuming compared with other operations. When the number of feature channels in the convolution layer is reduced by 1/2, the computation time of the convolutional layer is reduced by 3/4. But in other operations, the time reduction is not so strong. Therefore, reducing the number of convolutional channels and convolution operations is more beneficial to reducing the parameters and computation of the lightweight network.
Through the experiments and analyses in Section 3.1, we can find that reducing the number of convolutional channels plays an important role in the lightweight of the whole network, so we propose the structure of channel split residual learning to reduce the number of convolutional channels. In the channel split residual learning (Fig. 1(d)), we combine the general residual learning method with the recently popular group convolution, and we retain the method proposed by EDSR [13] which removed the BN layer. Meanwhile, we adopt group convolution in the second convolution, where the number of groups is equal to the number of input channels. If the number of input channels is 64 and the number of output channels is the same as the number of input channels, the number of parameters of the residual module in EDSR [13] is 3×3×64×2 2 (shown in Fig. 1(c)) and the number of parameters of the channel split residual structure is 3×3×64×64+3×3×1×64+1×1×64 (shown in Fig. 1(d)), which reduces the parameters to 50.78%. Assuming the size of the input feature map is H×W, the number of the floating point of operations (FLOPs) of the residual module in EDSR is calculated as 50.82% of the original.
If the input and output feature maps are represented by Xi and Xi+1, respectively, the residual structure proposed by EDSR can be expressed by the formula (1):
Then the channel split residual structure can be described by the formula (2):
where ReLU represents the activation function and fconv is an ordinary convolution operation with a 3×3 kernel. fG−conv denotes the group convolution operation, G is the number of groups. In CSRNet, G is equal to the number of input channels. To maintain the data flow between the channels, we refer to the ordinary addition operation in the network. Compared with channel shuffling [22] and identity skipping method in Ghost module [23], our method is more straightforward and computationally convenient without adding additional computations, which is an advantage of general residual learning.
There are two common types of upsampling networks: subpixel convolution and deconvolution. Subpixel convolution, proposed by ESPCN [2], is a method that corresponds the feature map to the HR image, which reduces the computation and memory consumption of deconvolution. In the subpixel, the extraction of HR features is the key to the quality of the reconstructed image since the final value of the HR feature map corresponds to the reconstructed HR image pixels one by one. It is different from other high-level visual tasks, which have far fewer prediction points than SISR. To reduce the pressure of feature extraction in the networks, we adopt a double-upsampling network structure, which is supplemented by a simple upsampling network compared with the general upsampling network. In Fig. 3(c), the depth feature extraction network in the red box is regarded as the residual network, marked as Res_Net, and the shadow network is regarded as the main network structure, marked as Main_Net. If Xinput is input as an LR image, Yre as the reconstructed HR image, and the real HR image is YH, then the feature map Xres_map extracted by Res_Net during the HR image reconstruction can be represented as:
The Main_Net feature map Xmain_map is represented as:
where fex−res and fex−main represent the Res_Net feature extraction network and Main_Net feature extraction network, respectively, and the neural network structure is composed of multiple channels separated by residual structures. The whole reconstruction process is represented as:
where fres_up and fmain_up represent the upsampling network of Res_Net and Main_Net reconstructed networks, respectively, and the upsampling network adopts the subpixel convolution method.
In order to make the network easy to train, we propose a loss function for the double upsampling network. It is described as follows:
From the formula (6), we can find that the loss function of the double upsampling network consists of two components: losss and loso⋅losss represents the loss between the HR image reconstructed by Main_Net and the real HR image YH, and losso represents the loss between the final reconstructed HR image Yre and the real HR image YH. . To reduce the difference between the reconstructed HR image and the real image in Main_Net, the reconstruction loss of Main_Net is also added to the final loss, which can also reduce the learning pressure of the residual learning upsampling network.
The existing evaluation metrics for lightweight networks include the number of parameters and FLOPs, both of which are objective evaluation metrics. However, the inference speed of the neural network is expressed by this evaluation metric inaccurately due to the optimization of the GPU internal acceleration mechanism. The reduced parameters and the percentage of computation cannot be matched with the percentage increase of the actual computation speed. In Table 2, we perform three sets of comparison experiments in terms of convention structure, number of convolutional channels, and network depth. 10_res_block_64_ 3×3 is our baseline control group. Res_block represents residual blocks, 10 stands for the number of residual blocks, 64 stands for the number of convolutional channels, and 3×3 represents the size of the convolutional kernel. 10_res_block_64_ indicates that 3×1+1×3 indicates that 3×3 convolution is replaced by the 3×1 and 1×3 convolution. In Table 2, each experiment is tested three times on TITAN X GPU using the TensorFlow framework for an image size of 256×256. . From Table 2, you can find that:
Compared with 3×3 convolution, 3×1 and 1×3 convolutional groups require more computation time. In compact network design, replacing 3×3 convolution with 3×1 and 1×3 convolutional groups does not reduce the computation time.
Theoretically, if the depth of the network is 1/2 of the original network, the time should be 1/2 of the original network, but in the experiment, the time is 4/7 of the original network.
Theoretically, if the number of channels of the network is 1/2 of the original network, and the time should be 1/4 of the original network, but in the experiment, the time is 1/3 of the original network.
The change of the number of channels has the greatest influence on the change of the whole network time, while the change of the convolutional kernel size has little influence on the change of the whole network time.
Table 2.
test1 | test2 | test3 | ||||
---|---|---|---|---|---|---|
Frame rate | FPS (s) | Frame rate | FPS (s) | Frame rate | FPS (s) | |
10_res_block_64_3×3 | 68.50 | 1.3800 | 69.83 | 1.4300 | 69.94 | 1.4300 |
10_res_block_64_3×11×3 | 60.23 | 1.6600 | 64.97 | 1.5300 | 59.18 | 1.6800 |
5_res_block_64_3×3 | 116.31 | 0.8126 | 110.56 | 0.8635 | 117.88 | 0.8030 |
10_res_block_128_3×3 | 29.00 | 3.3500 | 28.99 | 3.6632 | 28.58 | 3.6799 |
Therefore, we propose a new method to test the speed, but it requires a basic model as a baseline. As shown in Table 2, our baseline is 10_res_block_64_ 3×3. The time cost of the test includes only the time of the feature extraction network. To minimize the bias in statistical computation, we performed three tests for each experiment. In the tests, we computed the time cost of 105 frames, but not 100 frames, which is described as T_100_frame. The computation time for the first 5 frames is unstable in TensorFlow because the GPU needs some more time to prepare at the beginning of the computation. The frame rate T_100_frame is as follows:
In our experiments, we used the dataset DIV2K [24] to train the model, which contains 800 high-quality training images. First, we randomly cropped 256×256 images as the HR images and then generate the LR images with different down-sampling multiples from the HR images by bicubic interpolation. To increase the amount of training data, several dataset enhancement operations were performed on these images, including such as random level, flips, and rotation. We tested on four datasets, Set5 [25], Set14 [26], BSD100 [27], and Urban100 [28]. Set5 and Set14 are generally test benchmarks. BSD100 is composed of nature images from the segmentation dataset proposed by Berkeley Lab. Recently, the urban images provided by Huang et al. [28] are very interesting, because they contain many challenging images that are not available with existing methods. These four datasets can verify the effectiveness of the model. For the evaluation metric, two evaluation metrics are used: a measure of image quality and a lightweight network evaluation metric. The evaluation metrics for image quality include calculated PSNR and SSIM. The evaluation metrics for lightweight networks include parameters, Multi_Add, our proposed frame rate of 100 frames (100_FPS).
In our network, the convolution kernels are set to 3×3. The padding size of the convolution kernels is set to 1 to keep the size of the input and output feature maps consistent with the real HR images. The Res_Net uses a residual structure with 10 channel split residual blocks. In Res_Net, the first convolution uses 643×3×64 convolution kernels, and the second convolution structure uses 643×3×1 convolution kernels. In Main_Net, we only use a 5-layer general convolution as the feature extraction network. In Main_Net and Res_Net, we use the subpixel approach to generate high-resolution images. In training, we use the Adam optimizer [29] to minimize the loss function lossall. The initial training sets the learning rate at 0.001, which decreases by a factor of 10 after 3×104 iterations. We implemented the proposed network using the PyTorch framework and trained it using an NVIDIA Titan X GPU. The entire model was trained in less than 1 day.
To verify the validity of each part the proposed method, we conducted three groups of comparison experiments. In the ablation experiment, the baseline model is a residual learning structure proposed in EDSR. As can be found from Table 3, 10_res_block is the baseline model, and 10_res_split is our proposed channel splitting residual. Compared the first row with the second row, we can find that the parameters of the channel split residual are reduced by about 50% compared with the residual structure of EDSR. And the Multi_Adds of the channel split residual are reduced by about 75%. The amount of parameters and computations is reduced by about double and the frame rate is increased by 90%. The large reduction in parameters and computations also introduces some quality loss to the reconstructed images. In addition to testing the performance of the channel split residual structure, we also tested the performance of the double-upsampling network. The 10_res_split and 10_res_split_double in the table are used as a comparison experimental group. “Double” represents the double-upsampling network structure, and the unmarked one is the single upsampling network. From the two pairs of experiments, it can be found that the double-upsampling network has fewer parameters than the single-upsampling network in simple Res_Net, but the PSNR and SSIM are improved by 0.08 dB and 0.097, respectively. In terms of speed, the double-upsampling network does not increase the running time. Before the deep feature extraction network is finished in Res_Net, the shallow feature extraction network Main_Net has finished running and was waiting for the results of Res_Net. We also compared the loss function proposed for double-upsampling. 10_res_split_double_Lall indicates the use of a loss function lossall, 10_res_split_double indicates the use of loss function losso. Comparing this group of experiments reveals that the loss function lossall contributes to improving the quality of the reconstructed images in the same structured network.
Table 3.
set5_psnr | set5_ssim | Parameters | Multi_Adds | 100_FPS | |
---|---|---|---|---|---|
10_res_block | 32.50 | 0.8973 | 2622k | 4292.5G | 62 fps |
10_res_split | 32.21 | 0.8577 | 1277k | 929.3G | 118 fps |
10_res_split_double | 32.29 | 0.8674 | 1297k | 1092.5G | 118 fps |
10_res_split_double_Lall | 32.34 | 0.8792 | 1297k | 1092.5G | 118 fps |
10_res_block is the feature extraction network consisting of 10 residual blocks in Fig. 1(c); “split” stands for the structure of channel split residual in Fig. 1(d); “double” represents the structure of the double-upsampling network in Fig. 3(c). “Lall” indicates that our proposed loss function for the double upsampling network.
In Table 4, the proposed method CSRNet is compared with some general SISR methods, such as SRCNN [1], DRRN [5], and EDSR [13], as well as with lightweight image super-resolution methods. Table 4 shows PSNR and SSIM for all methods tested on the three datasets. In contrast to the earlier lightweight network DRRN [5], which is a network with only 22 layers, the proposed method CSRNet has 40 layers. Therefore, the CSRNet has more parameters compared to DRRN. When we test on Set5, the CSRNet has 0.46 dB and 0.005 more than DRRN in PSNR and SSIM, respectively. Meanwhile, the CSRNet is slightly faster than DRRN in terms of computational speed. We also compared it with EDSR, which the proposed channel split residual learnings references. The number of parameters and computation of the CSRNet is about 50% of that of EDSR, and there are some significant improvements in frame rate with little loss in reconstruction. In Table 4, we also compare with the lightweight image super-resolution network of FALSR [9], IDN [7], and CARN [30]. These three methods are similar in parameters and calculation, but IDN performs better than the other two methods in the quality of reconstructed images. Compared with the other two methods, the PSNR and SSIM of our method are improved by about 0.2 dB and 0.02, respectively. Although the CSRNet is slightly more parametric and computationally intensive than CARN, and the computational speed is naturally inferior to it, the CSRNet has 0.42 dB more in PSNR than CARN. Compared with the latest method MADNet [31], the proposed CSRNet is inferior to MADNet in some evaluation metrics. It indicates that it is still room for improvement, and we will continue to work hard on lightweight SISR. By comparing the frame rate and PSNR together (Fig. 4), it can be seen that although the CSRNet is inferior to EDSR in performance, it is much faster than EDSR in speed. Our method is not as fast as AWSR [32], FALSR, and CARN in terms of speed, but far superior to these three methods in terms of performance.
Table 4.
Scale | Methods | Parameters | Mult_Adds | 100_FPS | Set5 | Set14 | BSD100 |
---|---|---|---|---|---|---|---|
2 | SRCNN [1] | 57k | 52.47G | 213 fps | 36.66/0.9542 | 32.42/0.9603 | 31.36/0.8879 |
DRRN [5] | 297k | 6796.9G | 30 fps | 37.74/0.9591 | 33.23/0.9136 | 32.05/0.8973 | |
EDSR [13] | 2662k | 4292.5G | 32 fps | 37.78/0.9597 | 32.28/0.9142 | 32.05/0.8973 | |
AWSR [32] | 1397k | 320.5G | 56 fps | 38.11/0.9608 | 33.78/0.9189 | 32.49/0.9316 | |
FALSR [9] | 326k | 74.7G | 76 fps | 37.82/0.9590 | 33.55/0.9168 | 32.10/0.8987 | |
CARN [30] | 412k | 46.1G | 74 fps | 37.53/0.9583 | 33.26/0.9141 | 31.92/0.8960 | |
IDN [7] | 590k | 81.87G | 64 fps | 37.83/0.9600 | 33.30/0.9148 | 32.08/0.8985 | |
MADNet [31] | 1002k | 51.4G | 61 fps | 37.85/0.9600 | 33.38/0.9161 | 32.04/0.8979 | |
Ours | 1297k | 1092.5G | 60 fps | 37.80/0.9600 | 32.29/0.9138 | 32.07/0.8976 | |
3 | SRCNN [1] | 57k | 52.47G | 213 fps | 32.75/0.9090 | 29.28/0.8209 | 28.41/0.7863 |
DRRN [5] | 297k | 6796.9G | 30 fps | 34.03/0.9244 | 29.96/0.8347 | 28.95/0.8004 | |
EDSR [13] | 2662k | 4292.5G | 32 fps | 34.09/0.9248 | 30.00/0.8350 | 28.96/0.8001 | |
AWSR [32] | 1476k | 150.6G | 56 fps | 34.52/0.9281 | 30.38/0.8426 | 29.16/0.8069 | |
CARN [30] | 412k | 46.1G | 74 fps | 33.99/0.9236 | 30.08/0.8367 | 28.91/0.8000 | |
IDN [7] | 590k | 81.87G | 64 fps | 34.11/0.9253 | 29.99/0.8354 | 28.95/0.8013 | |
MADNet [31] | 1002k | 51.4G | 61 fps | 34.16/0.9253 | 30.21/0.8398 | 28.98/0.8323 | |
Ours | 1297k | 1092.5G | 60 fps | 34.10/0.9223 | 30.00/0.8332 | 28.97/0.8007 | |
4 | SRCNN [1] | 57k | 52.47G | 213 fps | 30.48/0.8628 | 27.49/0.7503 | 24.52/0.7221 |
DRRN [5] | 297k | 6796.9G | 30 fps | 31.68/0.8888 | 28.31/0.7720 | 27.38/0.7284 | |
EDSR [13] | 2662k | 4292.5G | 32 fps | 32.50/0.8973 | 28.72/0.7851 | 27.72/0.7418 | |
AWSR [32] | 1587k | 91.1G | 56 fps | 32.27/0.8960 | 28.69/0.7843 | 27.63/0.7385 | |
CARN [30] | 412k | 46.1G | 74 fps | 31.92/0.8903 | 28.42/0.7762 | 25.62/0.7694 | |
IDN [7] | 590k | 81.87G | 64 fps | 31.82/0.8903 | 28.25/0.7730 | 27.42/0.7297 | |
MADNet [31] | 1002k | 51.4G | 61 fps | 31.95/0.8917 | 28.42/0.7762 | 27.44/0.7327 | |
Ours | 1297k | 1092.5G | 60 fps | 32.34/0.8674 | 28.49/0.7872 | 27.50/0.7399 |
The horizontal represents the evaluation metrics and datasets, and the vertical represents the current methods of image super-resolution using lightweight concepts. The results in the table are ×2 ×3 ×4 on Set5, Set14, and BSD100.
Fig. 4.
Fig. 5.
In addition, the CSRNet outperforms these methods in terms of performance and speed compared to the earlier DRRN methods. During the test, we selected three test images from Urban100 shown in Fig. 5. The image on the left is the original image, and the image on the right is the resulting image cropped from the yellow box. From the result images, we find that the reconstructed images of EDSR have the best visual effect, followed by ours. However, the reconstructed images of CSRNet look a little bit sharper compared to other lightweight methods. The CSRNet method has advantages and disadvantages com¬pared to the currently popular methods, but it is an effective and useable method.
The SISR research serves as a preprocessing work for many high-level visual tasks, which have a greater requirement for the inference speed of reconstructed HR images. Therefore, we introduce the idea of lightweight artificial design networks into this research direction of image super-resolution. To design the lightweight image super-resolution model, we propose the structures of channel split residual learning and double-upsampling. Channel split residual learning mainly uses group convolution to reduce the number of convolutional computation parameters and speed up the computation. Double-upsampling uses a two upsampling structure to widen the up-sampling network to ensure the performance of the lightweight network without adding extra computation time overhead. We demonstrate the effectiveness of our method on different datasets and find that our method is comparable to the state-of-the-art in terms of the quality of the reconstructed images. Making the network lighter and higher quality will be the work of the future.
Copyright © 2005-2025 KIPS. All rights reserved.
Open Access Policy: This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.