Object Detection and Scene Recognition

Jun 2021 | Research Papers


In the past decade, artificial intelligence has been used in several fields of science to increase speed, accuracy, and efficiency, as well as lessen human error in traditional scientific processes (Malik & Baharudin, 2013). Artificial intelligence coupled with image recognition (also known as computer vision) has been widely employed in the medical field, where image feature extraction focuses mostly on color, texture, shapes, and other low-level features (Malik & Baharudin, 2013).

Image recognition refers to the process of automatically describing the content, events, and relationships between objects in an image (Miller & Evert, 2018). Two major subsets of image recognition that are commonly used for industrial applications include object detection and scene recognition.

Object detection is a process that pinpoints different components of an image and assigns labels to them according to preset object categories and their locations (coordinates) within the image. These components are enclosed in bounding boxes in order to determine their exact locations in the image. Creating object detection algorithms is one of the biggest challenges in computer vision (Shaifee, Chywl, Li & Wong, 2017). Current algorithms are becoming broader and more robust with region-based convolutional neural networks (R-CNNs) (Singh, Girish & Ralescu, 2017) and deep neural networks (Shaifee et al., 2017).

In the context of computer vision, scene recognition requires that a scene is detected, recognized, and understood (Aarthi & Chitrakala, 2017). Results are in the form of image-level “tags” for place categories and scene attributes (e.g., indoor, outdoor, no person). Basic visual features are first detected (e.g., edges, corners), and other higher-level visual information (e.g., colors, luminance, and textures) is used to characterize scenes. A scene recognition system must be robust enough to address the variations of the current scene category and effectively communicate learnings and predictions to users and other systems.


Related Literature

Object Detection

Object detection algorithms that make use of deep neural networks have reached peak accuracies and outperformed existing approaches. Convolutional neural networks (CNNs), particularly Region-CNN (R-CNN) approaches make algorithms more efficient by automatically segmenting objects that comprise a photo so that the classification algorithm only runs on the segmented regions and not as a sliding-window method across the whole image. Though individual training of object categories still needs to be done, this method decreases runtime by removing several duplicate iterations of the sliding-window method (Shaifee et al., 2017).

One approach popularly called You Only Look Once  was proposed recently to further optimize object detection. In this approach, segmentation (bounding boxes) and classification (computation of category probabilities) are computed simultaneously, resulting in significantly shorter runtimes compared to R-CNN (Shaifee et al., 2017).

Scene Recognition


The Scene UNderstanding (SUN) dataset was built through the use of WordNet. Specifically, 70,000 words that describe general places were selected. By clustering synonyms and separating homonyms, 900 scene categories were produced (Zhou, Lapedriza, Khosla, Oliva, & Torralba, 2018) .

The latest version of Places365 provides an extensive list of place categories. Places365-Standard has around 1.8 million images divided into 365 scene categories, with at most 5000 images per category (Zhou et al., 2018). There are 50 and 90 images per category in the validation and test sets, respectively. PCA-based duplicate removal was done within each scene category in Places365 and SUN datasets; hence, both datasets don’t contain similar images (Zhou et al., 2018).

Pre-trained CNNs

Four CNN architectures were tested by training them on Places365-Standard: AlexNet, GoogLeNet, VGG, and ResNet. Their accuracies are summarized in Table 1.

Industry and Academic Benchmarks

Object Detection

The most notable benchmark in object detection (hundreds of object categories and millions of images) is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The challenge started in 2010 with more than fifty participating organizations. A publicly available dataset that comprises the ILSVRC allows the open-source development of algorithms and continuous tracking and improvement of the models via yearly contest entries. The dataset consists of manually labeled training images for training and unlabeled test images for determining the accuracy of the model entries (Russakovsky et al., 2015).

There are two levels of image annotations as defined by the ILSVRC: (1) image-level annotation, wherein the absence or presence of an object is determined; and (2) object-level annotation, wherein the location of the bounding box enclosing a specific object is determined within the image. Tasks in the ILSVRC are categorized into three:

  • Image classification – A list of object categories found in the image is produced
  • Single-object localization – Bounding boxes indicating a single appearance of each object within a category
  • Object detection – Bounding boxes indicating all appearances of objects in all possible categories

Scene Recognition

The scene recognition tool can be utilized in customer segmentation.  For instance, images from this tool can be used by survey companies when conducting research. The following companies have employed image recognition schemes to address various problems. 

  • Google’s Cloud Vision API. Much like Clarifai, Cloud Vision offers pretrained models (Vision API) and allows users to train custom models (AutoML Vision) for object recognition. Google aims to equip less experienced developers with high-level machine-learning-driven image recognition tools for various use cases. User access is via a REST API. (web site: https://cloud.google.com/vision/ )
  • Amazon Rekognition API. Amazon’s Rekognition offers the following main features: object detection, facial recognition, facial analysis, object tracking, unsafe content detection, and text detection in images. Access is also via API, and video libraries are searchable via “index” inputs. Amazon suggests various use cases involving image and video inputs such as searching for missing persons via social media videos, video content screening, and facial verification. (web site: https://aws.amazon.com/rekognition/ )


This project aims to build object detection and scene recognition systems for future products. Tags, which include object and scene attributes, will be produced when an image is uploaded for processing.


  1.   Designing the AI approach for object detection and scene recognition. Construction of YOLO and Places365 databases was done to develop the scene recognition and object detection systems.
  2.   Preprocessing of data. Data from YOLO and Places365 were stored in a common database for easy data ingestion. 
  3.   Implementation.

○   Train and test the both object detection and scene recognition models. Both models will undergo an iterative learning process of scene attributes from the training images. The scene recognition algorithm will then be able to identify scenes.

○   Classify images. The trained models are capable of (1) detecting objects; and (2) recognizing scenes and classifying them into categories.

  1.   Assess results. The performance of the trained model for object detection and scene recognition systems will be evaluated using accuracy, specificity, and sensitivity.
  2.   Package the algorithms into two APIs: (1) object detection API and (2) scene recognition API.