Image Classification with EfficientNet: Better performance with computational efficiency

Anand Borad
8 min readDec 13, 2019

In May 2019, two engineers from Google brain team named Mingxing Tan and Quoc V. Le published a paper called “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks”. The core idea of publication was about strategically scaling deep neural networks but it also introduced a new family of neural nets, EfficientNets.

EfficientNets, as the name suggests are very much efficient computationally and also achieved state of art result on ImageNet dataset which is 84.4% top-1 accuracy.

So, in this article, we will discuss EfficientNets in detail but first, we will talk about the core idea introduced in the paper, model scaling.

Model scaling is about scaling the existing model in terms of model depth, model width, and less popular input image resolution to improve the performance of the model. Depth wise scaling is most popular amongst all, e.g. ResNet can be scaled from Resnet18 to ResNet200. Here ResNet10 has 18 residual blocks and can be scaled for depth to have 200 residual blocks.

ResNet200 delivers better performance than ResNet18 and thus, manually scaling works pretty well. But there is one problem with traditional manual scaling method, after a certain level, scaling doesn’t improve performance. It starts to affect adversely by degrading performance.

The scaling method introduced in paper is named compound scaling and suggests that instead of scaling only one model attribute out of depth, width, and resolution; strategically scaling all three of them together delivers better results.

Compound scaling

Compound scaling method uses a compound co-efficient ø to scale width, depth, and resolution together. Below is the formula for scaled attributes:

Here, alpha, beta, and gamma are scaling multiplier for depth, width and resolution respectively and be obtained using grid search. Let’s say we got alpha =1.2 after solving the above equation, then new depth = 1.2 * old depth.

ø is a user-specific co-efficient which takes real numbers like and controls resources which is 2ø. So if we have double resources available than what a model is currently using, we can take find ø using 2ø = 2 and hence ø is 1 for such cases.

Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.

Using this compound scaling method, they achieved brilliant improvement for MobileNets and ResNet model architecture as shown in below image.

Scaling Up MobileNets and ResNet

Authors observed that mobile scaling can be used on any CNN architecture and it works just fine but the overall performance very much depends on baseline architecture. With that observation in mind, they came up with the brand new base architecture and named it EfficientNet-B0.

The base model of EfficientNet family, EfficientNet-B0

The EfficientNet-B0 architecture wasn’t developed by engineers but by the neural network itself. They developed this model using a multi-objective neural architecture search that optimizes both accuracy and floating-point operations.

Taking B0 as a baseline model, the authors developed a full family of EfficientNets from B1 to B7 which achieved state of the art accuracy on ImageNet while being very efficient to its competitors.

Below is a table showing the performance of EfficientNets family on ImageNet dataset.

EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient φ in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets.

Here, we will deep dive into EfficientNet-B0 architecture. B0 is mobile sized architecture having 11M trainable parameters.

Before moving ahead, let’s see how this new architecture looks like:

One can see that architecture uses 7 inverted residual blocks but each is having different settings. These blocks also use squeeze & excitation block along with swish activation. We will discuss all three in detail in this article. let’s start with Swish.

Swish Activation

ReLu works pretty well but it got a problem, it nullifies negative values and thus derivatives are zero for all negative values. There are many known alternatives to tackle this problem like leaky ReLu, Elu, Selu etc., but none of them has proven consistent.

Google Brain team suggested a newer activation that tends to work better for deeper networks than ReLU which is a Swish activation. They proved that if we replace Swish with ReLu on InceptionResNetV2, we can achieve 0.6% more accuracy on ImageNet dataset.

Swish is a multiplication of a linear and a sigmoid activation.

Swish(x) = x * sigmoid(x)

from keras import backend as Kdef swish_activation(x):
return x * K.sigmoid(x)

Swish looks as shown in the below image:

Swish activation

It’s gradient looks as shown in the below image:

Derivatives of swish

Inverted Residual Block

The idea of a residual block was introduced in MobileNet architecture. MobileNet uses depthwise separable convolution inside the residual block which uses depthwise convolution first and then pointwise convolution. This approach decreases trainable parameters by a large number.

In an original residual block (introduced in ResNet), skip connections are used to connect wide layers (aka layers with a large number of channels) and there are fewer numbers of channels inside a block (aka narrow layers).

Residual block

The inverted residual block does the opposite, skip connections connects narrow layers while wider layers are between skip connections.

Inverted residual block

The code for the inverted residual block is as below:

from keras.layers import Conv2D, DepthwiseConv2D, Adddef inverted_residual_block(x, expand=64, squeeze=16):
block = Conv2D(expand, (1,1), activation=’relu’)(x)
block = DepthwiseConv2D((3,3), activation=’relu’)(block)
block = Conv2D(squeeze, (1,1), activation=’relu’)(block)
return Add()([block, x])

Learn more about inverted residual block here:

Squeeze and Excitation Block

When CNN creates output feature map from a convolutional layer, it gives equal weightage to each of channels. Squeeze and excitation (SE) block is a method to give weightage to each channel instead of treating them all equally.

SE block gives the output of shape (1 x 1 x channels) which specifies the weightage for each channel and the great thing is that neural network can learn this weightage by itself like other parameters.

below is the code:

from keras.layers import GlobalAveragePooling2D, Reshape, Conv2Ddef se_block(x, filters, squeeze_ratio=0.25):
x_ = GlobalAveragePooling2D()(x)
x_ = Reshape((1,1,filters))(x_)
squeezed_filters = max(1, int(filters * squeeze_ratio))
x_ = Conv2D(squeezed_filters , activation=’relu’)(x_)
x_ = Conv2D(filters, activation=’sigmoid’)(x_)
return multiply()([x, x_])

Learn more here about SE networks.

EfficientNet’ MBConv Block

Now, we had a brief introduction about all three building blocks used EfficientNets, let’s see how an MBConv block feels like.

Below is the code inspired from this brilliant repository on Github about EfficientNet.

MBConv block takes two inputs, first is data and the other is block arguments. The data is output from the last layer. A block argument is a collection of attributes to be used inside an MBConv block like input filters, output filters, expansion ratio, squeeze ratio etc. Argument blocks for B0 model as below:

argument_block = [ 
BlockArgs(kernel_size=3, num_repeat=1, input_filters=32, output_filters=16, expand_ratio=1, id_skip=True, strides=[1, 1], se_ratio=0.25),
BlockArgs(kernel_size=3, num_repeat=2, input_filters=16, output_filters=24, expand_ratio=6, id_skip=True, strides=[2, 2], se_ratio=0.25),BlockArgs(kernel_size=5, num_repeat=2, input_filters=24, output_filters=40, expand_ratio=6, id_skip=True, strides=[2, 2], se_ratio=0.25),BlockArgs(kernel_size=3, num_repeat=3, input_filters=40, output_filters=80, expand_ratio=6, id_skip=True, strides=[2, 2], se_ratio=0.25),BlockArgs(kernel_size=5, num_repeat=3, input_filters=80, output_filters=112, expand_ratio=6, id_skip=True, strides=[1, 1], se_ratio=0.25),BlockArgs(kernel_size=5, num_repeat=4, input_filters=112, output_filters=192, expand_ratio=6, id_skip=True, strides=[2, 2], se_ratio=0.25),BlockArgs(kernel_size=3, num_repeat=1, input_filters=192, output_filters=320, expand_ratio=6, id_skip=True, strides=[1, 1], se_ratio=0.25)
]

EfficientNet uses 7 MBConv blocks and above is specifications (argument block) for each of those blocks respectively.

  • kernel_size is kernel size for convolution e.g. 3 x 3
  • num_repeat specifies how many times a particular block needs to be repeated, must be greater than zero
  • input_filters and output_filters are numbers of specified filters
  • expand_ratio is input filter expansion ratio
  • id_skip suggests whether to use skip connection or not
  • se_ratio provides squeezing ratio for squeeze and excitation block
def mbConv_block(input_data, block_arg):
“““Mobile Inverted Residual block along with Squeeze
and Excitation block.”””
kernel_size = block_arg.kernel_size
num_repeat= block_arg.num_repeat
input_filters= block_arg.input_filters
output_filters= output_filters.kernel_size
expand_ratio= block_arg.expand_ratio
id_skip= block_arg.id_skip
strides= block_arg.strides
se_ratio= block_arg.se_ratio
# continue...

Expansion phase: we will expand our layer and make them wide as mentioned in Inverted residual block (connected blocks are narrow and inner blocks are wider, here we are making layer wider just by increasing the number of channels).

    # continue...    expanded_filters = input_filters * expand_ratio
x = Conv2D(expanded_filters, 1, padding=’same’, use_bias=False)(input_data)
x = BatchNormalization()(x)
x = Activation(swish_activation)(x)
# continue...

Depthwise convolution phase: after expansion, we perform depthwise convolution with kernel size mentioned in block argument.

# continue...    x = DepthwiseConv2D(kernel_size, strides, padding=’same’, use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation(swish_activation)(x)
# continue...

Squeeze and excitation phase: now, we extract global features with global average pooling and squeeze numbers of channels using se_ratio.

# continue...    se = GlobalAveragePooling2D()(x) 
se = Reshape((1, 1, expanded_filters ))(x)
squeezed_filters = max (1, int(input_filters * se_ratio))
se = Conv2D(squeezed_filters , 1, activation=swish_activation, padding=’same’)(se)
se = Conv2D(expanded_filters, 1, activation=’sigmoid’, padding=’same’)(se)
x = multiply([x, se])
# continue...

Here shape of se block would be (1, 1, expanded_filters) and the shape of x would be (h, w, expanded_filters). thus, the output of se block can be considered as weightage for each channel in the output of x. To give weightage, we simply multiply se output with x output.

Output phase: after the se block, we apply convolution operation that gives output filters mention in the argument block.

# continue...    x = Conv2D(output_filters, 1, padding=’same’, use_bias=False)
x = BatchNormalization()(x)
return x

below is the complete code by putting all together.

def mbConv_block(input_data, block_arg):
“““Mobile Inverted Residual block along with Squeeze
and Excitation block.”””
kernel_size = block_arg.kernel_size
num_repeat= block_arg.num_repeat
input_filters= block_arg.input_filters
output_filters= output_filters.kernel_size
expand_ratio= block_arg.expand_ratio
id_skip= block_arg.id_skip
strides= block_arg.strides
se_ratio= block_arg.se_ratio
# expansion phase expanded_filters = input_filters * expand_ratio
x = Conv2D(expanded_filters, 1, padding='same', use_bias=False)(input_data)
x = BatchNormalization()(x)
x = Activation(swish_activation)(x)
# Depthwise convolution phase x = DepthwiseConv2D(kernel_size, strides, padding='same', use_bias=False)(x)
x = BatchNormalization()(x)
x = Activation(swish_activation)(x)
# Squeeze and excitation phase se = GlobalAveragePooling2D()(x)
se = Reshape((1, 1, expanded_filters ))(x)
squeezed_filters = max (1, int(input_filters * se_ratio))
se = Conv2D(squeezed_filters , 1, activation=swish_activation, padding='same')(se)
se = Conv2D(expanded_filters, 1, activation=’sigmoid’, padding='same')(se)
x = multiply([x, se])
# Output phase x = Conv2D(output_filters, 1, padding='same', use_bias=False)
x = BatchNormalization()(x)
return x

Note that above very simplistic representation of MBConv block, an actual block is bit complex and considers few constraints. Learn more here.

EfficientNet can take smaller images as input also but it will be overkill for a dataset like MNIST. EfficientNets are advisable to use for complex datasets. We will be using EfficientNet B0 on CIFAR10 data and will train the model for 10 epochs. I have put the code for EfficientNetB0 on CIFAR10 on this google colab notebook so that you can play there.

--

--

Anand Borad

Learning the weights of life with ReLu. Taking positives and eliminating negatives.