## Dongcheul Lee*## |

Name | Equation |
---|---|

ReLU | [TeX:] $$f(x)=\left\{\begin{array}{l} 0, \text { for } x<0 \\ x, \text { for } x \geq 0 \end{array}\right.$$ |

ReLU6 | [TeX:] $$f(x)=\left\{\begin{array}{ll} 0 & \text { , for } x<0 \\ x, \text { for } 0 \leq x<6 \\ 6 & , \text { for } x \geq 6 \end{array}\right.$$ |

Leaky ReLU | [TeX:] $$f(x, \alpha)=\left\{\begin{array}{l} \alpha x, \text { for } x<0 \\ x, \text { for } x \geq 0 \end{array}\right.$$ |

ELU | [TeX:] $$f(x, \alpha)=\left\{\begin{array}{lr} \alpha\left(e^{x}-1\right), & \text { for } x<0 \\ x & , \text { for } x \geq 0 \end{array}\right.$$ |

SELU | [TeX:] $$f(x, \alpha, \beta)=\left\{\begin{array}{lr} \alpha \beta\left(e^{x}-1\right), \text { for } x<0 \\ \beta x & , \text { for } x \geq 0 \end{array}\right.$$ |

Swish | [TeX:] $$f(x)=\frac{x}{1+e^{-x}}$$ |

CReLU | [TeX:] $$f(x)=(\max (0, x), \max (0,-x))$$ |

Rectified linear unit (ReLU) can solve this problem and it is the most used activation function right now. It is half rectified from the bottom. It is computationally efficient and non-linear allowing the network to converge quickly and allowing for backpropagation. When the unit of ReLU is capped at 6, it is called ReLU6. This encourages the network to learn sparse input earlier. This is equivalent to imagining that each ReLU unit consists of only 6 replicated bias-shifted Bernoulli units, rather than an infinite amount [7]. However, both ReLU and ReLU 6 cause the dying ReLU problem: when inputs approach zero or are negative, the gradient of the function becomes zero so that the network cannot perform backpropagation and cannot learn.

Leaky ReLU has a small positive slope in the negative area, so it enables backpropagation even for negative input [8]. However, it does not provide consistent predictions for negative input. We used 0.2 for the value of α. Exponential linear unit (ELU) also has a small curved slope in the negative area to solve the dying ReLU problem. It saturates to a negative value with smaller inputs and thereby decrease the forward propagated variation and information [9]. We used 1.0 for the value of α. Scaled ELU (SELU) extends ELU inducing self-normalizing properties. It adds one more parameter to control not only the gradient of the negative area but also the positive area [10]. We used 1.76 for the value of and 1.05 for

Swish is a new, self-gated activation function discovered by researchers at Google [11]. According to their paper, it performs better than ReLU with a similar level of computational efficiency. In experiments on ImageNet with identical models running ReLU and Swish, the new function achieved top-1 classification accuracy 0.6%–0.9% higher [11].

Concatenated ReLU (CReLU) Concatenates a ReLU which selects only the positive part of the activation with a ReLU which selects only the negative part of the activation. As a result, this non-linearity doubles the depth of the activations [12].

We build an RL agent that can learn a 2D racing game in the OpenAI Gym to analyze the effect of the activation functions which were discussed in Section 2.2. The agent utilizes the ACER algorithm to learn the game using CNN and LSTM. Fig. 2 shows the composition of the network. We have the following implementation principle to build this agent: Before the agent provides input to the network, it converts a color image to a grayscale image and crops the border to reduce the complexity of the input. Then CNN processes the pixels from the game screen to decide which action to take. LSTM remembers the previous state to help the decision. The network includes 3 CNN, 1 fully-connected network, and 1 LSTM. Each CNN uses 32, 64, and 64 filters, respectively. The size of each filter is 8×8, 4×4, and 3×3 with a stride 4, 2, and 1, respectively. LSTM uses 512 cells. The activation functions are used in each hidden layer.

We used Enduro v4 in OpenAI Gym for playing a 2D racing game. The game consists of maneuvering a race car in a long-distance endurance race. The object of the game is to pass 200 cars each day. The driver should avoid other cars, otherwise, the driver’s car stops. As the driver passes another car, the reward increases. However as other cars pass the driver, the reward decreases.

The agent observes pixel data of the game [TeX:] $$s_{t} \text { at time step } t$$ chooses an action [TeX:] $$a_{t}$$ according to a policy [TeX:] $$\pi\left(a_{t} \mid s_{t}\right),$$ and observes a reward [TeX:] $$r_{t}^{t}$$ produced by the game. The goal of the agent is to maximize the discounted return [TeX:] $$R_{t}=\sum_{i \geq 0} \gamma^{i} r_{t+i}$$ at time step t. Discount factor [TeX:] $$\gamma \in[0,1)$$ trades-off the importance of rewards. The output layer of the network produce [TeX:] $$\pi\left(a_{t} \mid s_{t}\right)$$ and a value function [TeX:] $$V^{\pi}\left(s_{t}\right)$$. The state-action and state only value function is defined as:

The advantage function provides a relative measure of the value of each action, which is defined as:

We defined a loss function to minimize the total error of the training data. The function allows us to increase the weight for actions that yielded a positive reward, and decrease them for actions that yielded a negative reward. It is defined as:

The agent runs on Ubuntu 18.04 machine with Intel Xeon w-2102 CPU, Nvidia GeForce GTX 1080ti GPU, and 32G RAM. We build a python agent using TensorFlow 1.10, Keras 2.2, OpenAI Gym 0.13, and Python 3.6. The agent trains itself during [TeX:] $$1 \times 10^{7}$$ timesteps for each activation function. Hyperparameters in the agent are listed in Table 2.

To evaluate the training performance of each activation function, we compared the reward, the value of [TeX:] $$A^{\pi}\left(s_{t}, a_{t}\right),$$ and the value of [TeX:] $$L^{\pi}\left(s_{t}\right)$$ during the training. Fig. 3 shows performance metrics for each activation function along with timesteps during training. The plot was smoothed with a smoothing factor 0.8. Original data is plotted with a dim color. Except for CReLU, every activation function gets increased reward as timestep increases. Leaky ReLU got the highest reward whereas CReLU got the smallest at the end of the training. In the case of CReLU, it seems the agent was stuck in local optima at 4.2M and 9.4M timestep. Reducing the learning rate or incorporating a method of maintaining stochasticity will resolve this. There is not much difference in [TeX:] $$A^{\pi}\left(s_{t}, a_{t}\right) \text { and } L^{\pi}\left(s_{t}\right).$$

Fig. 4 shows the violin plot of performance metrics for each activation function during training. Leaky ReLU has high probability density on the high reward whereas CReLU has a high density on the low reward. Most values of [TeX:] $$A^{\pi}\left(s_{t}, a_{t}\right)$$ are densely distributed around the negative area of zero, which means most of the actions taken had been stably distributed around the optimal actions. The values of are distributed around 20 after 1.0M timestep meaning our model needs to be improved to reduce the loss. The plot has a greater variance than the plot of [TeX:] $$L^{\pi}\left(s_{t}\right)$$ [TeX:] $$A^{\pi}\left(s_{t}, a_{t}\right).$$

After [TeX:] $$1 \times 10^{7}$$ 107 timesteps’ training, 100 episodes of the game were played during the testing. Table 3 shows the mean and maximum value of the reward for each activation function. ReLU6 achieved the highest mean reward while CReLU was the smallest. The ReLU6’s reward is 35.4% higher than the CReLU’s. ReLU6 also achieved the highest maximum reward while Swish was the smallest. The ReLU6’s reward is 42.8% higher than the Swish’s. Even though Leaky ReLU got the highest reward in training, ReLU6, which got the second-highest reward in training, got the highest in testing due to the stochastic nature of the algorithm.

In this paper, we have compared the performance of the activation functions in the RL agent to learn a 2D racing game. We have built the agent using the ACER algorithm and CNN+LSTM model. We have tested the activation functions in the hidden layer of CNN by switching them together. We have measured the reward, the output of the advantage function, and the output of the loss function while training and testing. As a result of performance evaluation, we have found ReLU6 performs better than other activation functions whereas CReLU performs the worst in terms of the total reward. The difference between them was 35.4%. This result shows that choosing an activation function is important since we can get a worse reward even though we used the same RL scheme in the same environment. Most RL papers have used ReLU not mentioning why they use it. Through our benchmark, we can see that the result would vary depending on which activation function they’ve used.

As future work, we will evaluate the performance of other factors than the activation function for the RL agent. Through this work, we can figure out suitable factors to improve the performance depending on the type of a game or the learning algorithm. We hope we can utilize this result in real-world applications such as a self-driving car.

He has been an associate professor in the Department of Multimedia at Hannam University, Korea, since 2012. He received the B.S. and M.S. degrees in Computer Science and Engineering from POSTECH, Korea in 2002 and 2004, respectively, and Ph.D. degrees in Electronics and Computer Engineering from Hanyang University, Korea in 2012. He had been a senior researcher at the KT Mobile Network Laboratory, Korea from 2004 to 2012. His main topics of interest have been communication in mobile applications, rich communication service, a software framework for mobile games, and machine learning in games.

- 1 A. Jeerige, D. Bein, A. Verma, "Comparison of deep reinforcement learning approaches for intelligent game playing," in
*Proceedings of 2019 IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC)*, Las V egas, NV, 2019;pp. 366-371. custom:[[[-]]] - 2 M. N. Moghadasi, A. T. Haghighat, S. S. Ghidary, "Evaluating Markov decision process as a model for decision making under uncertainty environment," in
*Proceedings of 2007 International Conference on Machine Learning and Cybernetics*, Hong Kong, China, 2007;pp. 2446-2450. custom:[[[-]]] - 3 D. Lee, B. Park, "Comparison of deep learning activation functions for performance improvement of a 2D shooting game learning agent,"
*The Journal of the Institute of InternetBroadcasting and Communication*, vol. 19, no. 2, pp. 135-141, 2019.custom:[[[-]]] - 4 R. Yamashita, M. Nishio, R. K. G. Do, K. Togashi, "Convolutional neural networks: an overview and application in radiology,"
*Insights into Imaging*, vol. 9, no. 4, pp. 611-629, 2018.custom:[[[-]]] - 5
*D. W. Lu, 2017 (Online). Available:*, https://arxiv.org/abs/1707.07338 - 6
*G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W . Zaremba, 2016 (Online). Available:*, https://arxiv.org/abs/1606.01540 - 7
*L. Lu, Y. Shin, Y. Su, and G. E. Karniadakis, 2019 (Online). Available:*, https://arxiv.org/abs/1903.06733 - 8 X. Zhang, Y. Zou, W. Shi, "Dilated convolution neural network with LeakyReLU for environmental sound classification," in
*Proceedings of 2017 22nd International Conference on Digital Signal Processing (DSP)*, London, UK, 2017;pp. 1-5. custom:[[[-]]] - 9 A. Shah, E. Kadam, H. Shah, S. Shinde, S. Shingade, "Deep residual networks with exponential linear unit," in
*Proceedings of the 3rd International Symposium on Computer Vision and the Internet*, Jaipur, India, 2016;pp. 59-65. custom:[[[-]]] - 10
*Z. Huang, T. Ng, L. Liu, H. Mason, X. Zhuang, and D. Liu, 2019 (Online). Available:*, https://arxiv.org/abs/1910.01992 - 11 G. C. Tripathi, M. Rawat, K. Rawat, "Swish activation based deep neural network predistorter for RF-PA," in
*Proceedings of 2019 IEEE Region 10 Conference (TENCON)*, Kochi, India, 2019;pp. 1239-1242. custom:[[[-]]] - 12 Z. Wang, X. Xu, "Efficient deep convolutional neural networks using CReLU for A TR with limited SAR images,"
*The Journal of Engineering*, vol. 2019, no. 21, pp. 7615-7618, 2019.custom:[[[-]]]