Activation Functions with Derivative and Python code: Sigmoid vs Tanh Vs Relu

A AKSHAY
5 min readFeb 4, 2020

Hai friends

Here I want to discuss about activation functions in Neural network generally we have so many articles on activation functions.

Here I want discuss every thing about activation functions about their derivatives,python code and when we will use.

This article will cover…..

a) Why Activation?

b) Important Activation Functions

c)Function Equations and its Derivatives

d)With Python code

e) And when we will use and which one use

Why Activation Function?

In Neural Network every neuron will do two computations:

a)Linear summation of inputs, if we see the above diagram , it has two inputs

x1,x2 and bias(b).we have weights w1,w2

sum=(w1*x1+w2*x2)+b

b) Activation computation.This computation decides, whether a neuron should be activated or not, by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron.

Why do we need Non-linear activation functions :-
A neural network without an activation function is essentially just a linear regression model. The activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks.

Types of Activation function:

  1. Sigmoid
  2. Tanh or Hyperbolic
  3. ReLu(Rectified Linear Unit)

Now we will look each of this

1)Sigmoid:

It is also called as logistic activation function.

f(x)=1/(1+exp(-x) the function range between (0,1)

Derivative of sigmoid:

just simple u/v rule i.e (vdu-udv)/v²

df(x)=[(1+exp(-x)(d(1))-d(1+exp(-x)*1]/(1+exp(-x))²

d(1)=0,

d(1+exp(-x))=d(1)+d(exp(-x))=-exp(-x) so

df(x)=exp(-x)/(1+exp(-x))²

df(x)=[1/(1+exp(-x))]*[1-(1/(1+exp(-x))]

df(x)=f(x)*(1-f(x))

Python Code:

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(x):
s=1/(1+np.exp(-x))
ds=s*(1-s)
return s,dsx=np.arange(-6,6,0.01)
sigmoid(x)# Setup centered axes
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines['left'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')# Create and show plot
ax.plot(x,sigmoid(x)[0], color="#307EC7", linewidth=3, label="sigmoid")
ax.plot(x,sigmoid(x)[1], color="#9621E2", linewidth=3, label="derivative")
ax.legend(loc="upper right", frameon=False)
fig.show()

Observations:

(i) If you see the sigmoid function has values between 0 to 1

(ii) The output is not Zero-Centered

(iii) Sigmoids saturate and kill gradients.

(iv) see at top and bottom level of sigmoid functions the curve changes slowly,if you calculate slope(gradients) it is zero,that is shown in derivative curve above.

Problem with sigmoid:

due to this when the x value is small or big the slope is zero — ->then there is no learning — ->then there is no learning.

When we will use Sigmoid:

(i) if you want output value between 0 to 1 use sigmoid at output layer neuron only

(ii) when you are doing binary classification problem use sigmoid

otherwise sigmoid is not preferred

2)tanh or Hyperbolic:

The tanh function is just another possible functions that can be used as a nonlinear activation function between layers of a neural network. It actually shares a few things in common with the sigmoid activation function. They both look very similar. But while a sigmoid function will map input values to be between 0 and 1, Tanh will map values to be between -1 and 1.

You will also notice that the tanh is a lot steeper.

Like the sigmoid function, one of the interesting properties of the tanh function is that the derivative can be expressed in terms of the function itself. Below is the actual formula for the tanh function along with the formula for calculating its derivative.

Derivative of tanh(z):

a=(e^z-e^(-z))/(e^z+e^(-z)

use same u/v rule

da=[(e^z+e^(-z))*d(e^z-e^(-z))]-[(e^z-e^(-z))*d((e^z+e^(-z))]/[(e^z+e^(-z)]²

da=[(e^z+e^(-z))*(e^z+e^(-z))]-[(e^z-e^(-z))*(e^z-e^(-z))]/[(e^z+e^(-z)]²

da=[(e^z+e^(-z)]²-[(e^z-e^(-z)]²/[(e^z+e^(-z)]²

da=1-[(e^z-e^(-z))/(e^z+e^(-z)]²

da=1-a²

Python code:

import matplotlib.pyplot as plt
import numpy as np

def tanh(x):
t=(np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
dt=1-t**2
return t,dtz=np.arange(-4,4,0.01)
tanh(z)[0].size,tanh(z)[1].size# Setup centered axes
fig, ax = plt.subplots(figsize=(9, 5))
ax.spines['left'].set_position('center')
ax.spines['bottom'].set_position('center')
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')# Create and show plot
ax.plot(z,tanh(z)[0], color="#307EC7", linewidth=3, label="tanh")
ax.plot(z,tanh(z)[1], color="#9621E2", linewidth=3, label="derivative")
ax.legend(loc="upper right", frameon=False)
fig.show()

Observations:

(i)Now it’s output is zero centered because its range in between -1 to 1 i.e -1 < output < 1 .

(ii) Hence optimization is easier in this method hence in practice it is always preferred over Sigmoid function .

But still it suffers from Vanishing gradient problem.

When will use:

Usually used in hidden layers of a neural network as it’s values lies between-1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps in centering the data by bringing mean close to 0. This makes learning for the next layer much easier.

3)ReLu:

Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.

Value Range :- [0, inf)

Nature :- non-linear, which means we can easily backpropagate the errors and have multiple layers of neurons being activated by the ReLU function.

Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves simpler mathematical operations. At a time only a few neurons are activated making the network sparse making it efficient and easy for computation.

it avoids and rectifies vanishing gradient problem . Almost all deep learning Models use ReLu nowadays.

But its limitation is that it should only be used within Hidden layers of a Neural Network Model.

Another problem with ReLu is that some gradients can be fragile during training and can die. It can cause a weight update which will makes it never activate on any data point again. Simply saying that ReLu could result in Dead Neurons.

To fix this problem another modification was introduced called Leaky ReLu to fix the problem of dying neurons. It introduces a small slope to keep the updates alive.

We then have another variant made form both ReLu and Leaky ReLu called Maxout function .

So That’s it, i hope now you can able to look most used activation function by mathematically, Graphically and with code.

Feel free to comment…

References:

  1. Anish singh walia
  2. Ronny restrepo
  3. Geeksforgeeks

Thank you all…!

--

--