Combined Query Based Image Retrieval Using Structure Elements Descriptor

Gunho Lee and Minjoong Jeong

Abstract

Abstract: Similar image retrieval involves identifying and ranking images from a database based on visual attributes such as color, texture, and shape, with the goal of finding those most closely matching a given query image. This task requires precise analysis of image content to achieve accurate results. In this study, we propose an approach that incorporates structural information derived from an image segmentation model. This structural information highlights image characteristics, such as object shapes and their backgrounds, which are not fully captured by traditional dense global descriptors. By combining this structural information with global descriptors, our method captures both detailed shapes and broader image features in a user-controllable manner. Experimental results demonstrate the effectiveness of this integration approach in improving the performance of similarity search tasks.

Keywords: Global Descriptor , Image Retrieval , Structural Element Descriptor

1. Introduction

Image retrieval is a foundational task in computer vision that centers on computing the similarity between a query image and a set of images in a database. Early approaches to this task were dominated by content-based image retrieval (CBIR) [1], which focused on analyzing overall image features such as color and texture, treating the image as a unified entity. While CBIR is effective for identifying images with similar overall appearances, it faces challenges when the retrieval task involves specific objects within complex backgrounds. On the other hand, object-based image retrieval (OBIR) [2] was introduced, targeting specific objects within an image while disregarding the surrounding background. Although OBIR excels in identifying particular objects, its reliance on precise object detection poses challenges, especially in cluttered or densely populated scenes.

With the advancements in deep learning, extracting dense global descriptors and detecting various objects in images have significantly improved similar image retrieval tasks. Convolutional neural networks (CNNs), for example, have proven highly effective in generating robust global features for image classification. In addition, object detection models such as YOLO [3] and Mask R-CNN [4] exhibit exceptional performance in identifying objects within images. Despite these advancements, challenges still remain when objects are unclear or when detection models fail to accurately identify them.

In this study, we propose a novel approach that integrates structural information with global descriptors to effectively capture both the local details and global attributes of images in a user-directed manner. Specifically, we enhance image similarity evaluations by merging rich feature representations derived from the VGG [5] neural network with structural insights obtained from the segment anything model (SAM) [6]. In this paper, we use the term “feature representation” to refer broadly to CNN-derived feature maps or vectors, while “global descriptor” specifically denotes the final, fixed-length vector summarizing an entire image, used for similarity comparison. Unlike conventional segmentation models restricted to predefined classes, SAM can segment arbitrary objects and backgrounds, making it particularly versatile. Our method uniquely fuses the comprehensive feature analysis provided by the global descriptors of the VGG model with the precise structural details identified by SAM. This user-conditioned framework allows for customizable focus in image retrieval tasks, enabling a flexible emphasis on either global features, structural elements, or a balanced integration of both. The user-centric design ensures optimized retrieval outcomes tailored to specific search intents, enhancing both accuracy and relevance. The main contributions of this study are as follows:

1) Integration of global and structural features: Development of a framework that synergistically blends VGG global descriptors with SAM-derived structural information.

2) User-directed flexibility: Introduction of a customizable approach allowing users to control the emphasis between global and structural image features.

The remainder of this paper is organized as follows: Section 2 reviews related work, and Section 3 introduces the proposed method for similarity computation. Section 4 describes the datasets and presents the experimental results of our studies. Finally, Section 5 concludes the paper with suggestions for future research and discusses the limitations of our work.

2. Related Work

Prior to the advent of deep learning, feature extraction in image retrieval primarily relied on hand-crafted image representations, such as the Scale-Invariant Feature Transform (SIFT) [7] and bag of visual words (BoVW) [8]. These methods were effective in capturing local features but lacked the adaptability and robustness required for more complex tasks. Over the past decade, the emergence of artificial intelligence has significantly shifted the paradigm, positioning data-driven approaches as dominant alternatives to traditional hand-crafted features. The introduction of various network architectures, particularly CNNs, has catalyzed advancements in the image retrieval domain. CNNs have proven to be particularly adept at learning feature representations directly from data, supplanting the role of local descriptors like SIFT. This transition has established CNNs as fundamental tools in a range of classical computer vision tasks, including semantic segmentation [9] and image retrieval [10], driving significant improvements in accuracy and efficiency.

A comprehensive survey [1] has documented the evolution of deep learning-based image retrieval methods over the past decade, offering an in-depth classification of state-of-the-art approaches based on key factors such as supervision type, network architecture, descriptor type, and retrieval techniques. This survey highlights the paradigm shift from traditional hand-crafted features to advanced deep learning models, providing valuable insights into the progress and ongoing challenges in the field. Among the notable advancements is a hybrid approach for CBIR that integrates deep learning with traditional machine learning techniques. This method employs pre-trained models such as ResNet-50 and VGG-16 for robust feature extraction, coupled with the K-nearest neighbors (KNN) algorithm to calculate image similarity. Demonstrating significant improvements in retrieval accuracy, this hybrid system has been effectively applied in diverse domains, including digital libraries and crime prevention, proving its versatility and potential impact.

In addition to global features, shape information extracted through edge detection methods serves as a crucial visual feature in CBIR. The quality of edge detection plays a pivotal role in determining retrieval performance. For instance, the Canny edge detector [11] is widely recognized for its effectiveness in reducing error rates, accurately localizing edge points, and providing a single response per edge. Despite its advantages, the method often faces challenges with noise suppression and preserving low-intensity edges, which can impact the overall retrieval accuracy. While edge detection remains a critical component of CBIR, it is insufficient to assume that detected edges alone can semantically represent objects and backgrounds within images. To address these limitations, integrating segmentation models has emerged as a promising approach. Segmentation models enable the capture of structural information by delineating objects and backgrounds, thereby offering a more comprehensive feature set for effective retrieval. This integration underscores the potential of utilizing segmentation models as a foundational method for extracting structural information from images.

In OBIR, the retrieval process relies on local features extracted by object detection models, such as Faster R-CNN. This approach represents images based on object-specific information while excluding background elements, making it particularly suitable for applications that require attention to specific labeled objects. However, these methods typically depend on predefined object categories and are optimized for single-object datasets. As a result, their performance degrades when objects are partially occluded, unlabeled, or not included in the training data. Mask R-CNN, as an extension of Faster R-CNN, enhances object detection by introducing a mask prediction branch for each region of interest (RoI), allowing it to produce precise instance-level segmentation. Nevertheless, it still shares the core limitation of relying on class-specific supervision. This constraint reduces its ability to generalize to unseen categories or background elements, which limits its applicability in open-domain retrieval settings.

To address these limitations, we adopt the SAM, a recently proposed segmentation framework that generates class-agnostic segmentation masks. Unlike conventional segmentation models that require annotated labels, SAM is trained to generalize to arbitrary objects and regions, regardless of whether they were part of the training distribution. This capability makes SAM particularly suitable for our structure-oriented image retrieval framework, which focuses on shape and layout similarity rather than categorical identity. By using SAM, we can capture structural cues from both foreground and background regions, enabling the retrieval of semantically or visually similar images without being restricted to known object classes.

Region weighting strategies have been explored in various content-based image retrieval systems. One of the earliest examples is the integrated region matching (IRM) model [12], which assigns importance to regions based on their area, under the assumption that larger regions carry more semantic content. Similarly, a variable region-weight assignment approach was proposed to filter out less relevant regions based on their estimated significance [13]. More recent methods incorporate saliency cues [14], merge regions and compute significance indices [15], or apply query-dependent attention weighting [16]. However, these methods often rely on pre-trained saliency or detection models, handcrafted merging heuristics, or class-specific labels.

Recent studies have further explored integrating structural segmentation with global feature representations. For instance, a global saliency-weighted descriptor was proposed that emphasize visually prominent regions [14]. A query-sensitive co-attention mechanism was introduced to adaptively align features between queries and candidates. More recently, a dual-branch fusion network was employed to separately encode and merge local and global representations for fine-grained image retrieval [17]. While these approaches demonstrate promising performance, they often rely on supervised attention training or task-specific fusion designs, which may limit their generalizability and interpretability across domains.

3. Proposed Method

This section outlines the proposed method for extracting structural features and leveraging them for image retrieval. As depicted in Fig. 1, the process begins by obtaining segmentation masks through the SAM using a query image. These masks enable segmentation to identify objects and background regions within the image. While these masks provide valuable insights into individual image components, it is challenging to determine which masks best represent the structural information.

To achieve this, we propose a method for overlapping and merging masks. Each mask is represented as a matrix corresponding to the segmented objects. However, a straightforward merging of mask matrices fails to distinguish between significant and less significant objects. To solve this issue, we assign varying weight levels to each mask based on the size of the segmented region. As illustrated in Fig. 1, the weighted mask matrices are arranged in descending order, where more weight is allocated to larger masks. By applying this approach, overlapping these weighted mask matrices results in the generation of a structural feature termed the structural element descriptor (SED).

Fig. 1.
Structure elements descriptor generation.

In our framework, we adopt a region size-based weighting strategy inspired by earlier work such as the IRM model [11], which assumes that larger regions convey more semantic importance. Unlike more recent alternatives that incorporate pre-trained saliency models [13], semantic significance estimation [14], or attention mechanisms requiring class-specific supervision [15], our approach remains simple, interpretable, and fully class-agnostic. By weighting segmentation masks solely based on their area, our method naturally aligns with the multi-mask outputs of SAM and avoids reliance on external models or handcrafted heuristics, making it especially suitable for generic, structure-focused image retrieval.

While the SED effectively captures structural information, it lacks color and texture details, which are essential for image understanding. To complement the SED, we integrate it with the query image, as depicted in Fig. 2, enabling a comprehensive representation of both structural and visual aspects of the image. This blending strategy proves particularly effective when handling diverse images, whether they focus on objects or backgrounds. To enhance flexibility, we introduce a user-controllable similarity search method that allows users to adjust the relative importance of structural and global features based on their specific needs. When integrating the SED and query image features to form a composite query, a mixed ratio is specified to assign weights to each feature type. The ratio parameter [TeX:] $$w_1$$ represents the proportion of SED features, while [TeX:] $$w_2\left(1.0-w_1\right)$$ denotes the weight of the original query image features. This weighted strategy provides a versatile framework for tailoring the similarity search process to accommodate the diversity and complexity of image datasets.

To implement feature vector extraction for the combined query, we utilize a pre-trained VGG model trained on the ImageNet dataset. This model is well-suited for extracting common global features from images due to its robust feature learning capabilities. By removing the classification head of the VGG-16 model, the remaining layers function as a feature extractor, providing global descriptors of the visual characteristics of image. Using this VGG-based feature extraction, the dataset images are processed to generate feature vectors for each image. For the retrieval process, the feature vectors of the combined query are computed and compared with the pre-computed VGG-based feature vectors of the database images. Cosine similarity is employed as a metric to calculate the similarity between the combined query and each image in the database. Based on the similarity scores, the images in the database are ranked, with higher scores indicating greater similarity to the query image. The images with the highest similarity scores are ranked at the top, providing a prioritized list of results that align closely with characteristics of query image.

Fig. 2.
Combined query-based image retrieval through structure elements descriptor.
Blending and weighted strategy

Algorithm 1 describes the procedure for generating an integrated feature vector from a query image and an optional SED image using a pre-trained VGG model. The method is designed to flexibly incorporate both global visual features and structural details, allowing for customizable image retrieval tasks.

The parameter img_path refers to the file path of the input query image, which serves as the source for extracting global features. The optional parameter SED_image represents a 2D gray-scale SED that provides additional structural information. When included, this input is blended with the global features of the query image to enhance the feature representation. The parameter SED ratio is a float value that determines the relative contribution of the query image and SED features in the final combined representation. By default, this parameter is set to 0.5, assigning equal weight to both inputs. It can be adjusted to emphasize either the global visual features or the structural details, depending on the specific requirements of the retrieval task.

The output of Algorithm 1 is a feature vector generated from the mixed image. This vector encapsulates both the visual attributes of the query image, and the structural information derived from the SED image, resulting in a richer and more adaptable representation. This flexible approach ensures that the generated feature vector is suitable for a variety of image retrieval scenarios, optimizing the search process for diverse datasets and user preferences. By integrating global and structural features, our method enhances the capability to handle images with varying levels of complexity and detail.

4. Experiments

In this section, we present experiments designed to evaluate the effectiveness of the proposed method, which integrates dense global descriptors with structural information. The experiments were conducted on three diverse image datasets, each representing different scales and types of objects within their images. The first dataset, the “Paris” dataset, consists of 4,895 landmark images across 11 classes, captured under varying conditions such as lighting and viewpoint. The second dataset, the “Caltech256” dataset, includes 30,607 object-centric images distributed across 256 classes, representing a broad range of object categories. Finally, the third dataset, the “AID” dataset, contains 10,000 aerial scene images organized into 30 classes, offering a distinct perspective from aerial photography. To evaluate the performance of the proposed image retrieval method, we employed weighted accuracy as the primary evaluation metric. This metric reflects the class consistency of retrieved results while taking their ranking into account. It is defined as follows:

(1)
[TeX:] $$\text { Weighted Accuracy }=\frac{\sum_{i=1}^N\left(\lambda_i \cdot \delta_i\right)}{\sum_{i=1}^N \lambda_i} \times 100 \%$$

where N denotes the number of retrieved images. The parameter [TeX:] $$\lambda_i$$ represents the weight assigned to the i-th retrieved image, with higher-ranked images typically receiving higher weights. The term [TeX:] $$\delta_i$$ is an indicator function that equals 1 if the class label of the i-th retrieved image matches that of the query image and 0 otherwise. In addition to this quantitative evaluation, a qualitative evaluation was performed by human reviewers, who assessed the structural similarity between the query image and the retrieved results. Both quantitative and qualitative evaluation approach provides a comprehensive measure of retrieval performance, accounting for both class consistency and perceptual similarity.

In our experiments, we compared the proposed similarity computation method with a baseline approach using a simple VGG-based computation, referred to as VGG. Unlike the baseline, our method allows for the flexible integration of two distinct feature types: structural information and global descriptors. To achieve this, we introduced two adjustable weights, [TeX:] $$w_1 \text { and } w_2\left(1.0-w_1\right),$$ which control the relative contribution of SED and global descriptors in the integrated similarity computation. To investigate the impact of adjustable weights, we conducted experiments using the Paris dataset with different SED ratio: 0.2 (favoring global descriptors), 0.5 (neutral), and 0.8 (favoring structural features). As shown in Fig. 3, when the SED ratio was set to 0.2, the retrieval performance closely resembled that of the VGG-only baseline. This is likely because the influence of the structural features was minimal, and the strong global characteristics of the original descriptor dominated the similarity evaluation. Conversely, at a ratio of 0.8, the performance dropped significantly. We observed that retrieval results failed to match the query image's visual structure effectively, likely due to the insufficient contribution of the original global descriptor, which plays a crucial role in encoding color and texture. Based on these observations, we set the SED ratio to a neutral value of 0.5 in our main experiments to ensure a balanced contribution of both global and structural features. To quantitatively support these observations, we report the weighted accuracy scores for each blending ratio on the Paris dataset. The results show that the accuracy was 75.5% for ratio 0.2, 82.9% for 0.5, and 55.1% for 0.8, confirming that the neutral configuration (0.5) yields the most effective balance between global and structural cues.

Fig. 3.
Retrieval results with varying SED blending ratios (0.2, 0.5, 0.8) compared to the VGG-only.
4.1 Landmark Image Retrieval

As shown in Fig. 4, the image retrieval results for the Sacré-Coeur query image reveal notable differences between the VGG-only method and the proposed SED-combined approach. The VGG-only method achieved an accuracy of 50% for retrieving images of the same class. It predominantly retrieved nighttime photos with bright buildings and dark backgrounds, indicating a reliance on color similarity. Additionally, the VGG-only method ranked images of other cathedrals and even the Eiffel Tower, which differed structurally from the query image, highlighting its inability to effectively capture structural features. In contrast, the SED-combined method achieved 100% accuracy for a Sacré-Coeur query image, retrieving a balanced mix of daytime and nighttime photos while ranking images with structural features closely resembling those of the query image. These results highlight the advantage of the SED-combined approach in enabling the retrieval of not only visually similar images but also those with diverse and distinct characteristics. Fig. 5 illustrates that the same retrieval results are presented using masked images, which highlight the structural information more clearly.

Fig. 4.
The results of image retrieval for Sacré-Coeur image.
Fig. 5.
The results of masked image retrieval for Sacré-Coeur image.
4.2 Object Image Retrieval

As shown in Fig. 6, the VGG-only method, when given an octopus image as a query, predominantly ranked starfish images, reflecting their reliance on general visual similarity rather than specific structural features. In contrast, the SED-combined method successfully ranked octopus images for the majority of the results, demonstrating its ability to capture and prioritize the characteristics of the query image. This highlights the superiority of the SED-combined approach in aligning retrieval outcomes more closely with the defining features of query. For both the Paris and Caltech256 datasets, structural information played a crucial role in improving the performance of similar image search.

Fig. 6.
The results of image retrieval for octopus image.
4.3 Aerial Scene Image Retrieval

As shown in Fig. 7, the VGG-only method, when given an airport image as a query, predominantly ranked images with similar color and texture, such as runways and buildings, rather than focusing on structural elements like airplanes. In contrast, the SED-combined method struggled to retrieve relevant airport images. This outcome was influenced using the ascending order in the weighting strategy, which assigns higher weights to smaller objects. While this approach aimed to emphasize airplanes, which are relatively small objects in the query image, the actual segmentation masks in Fig. 8 revealed numerous objects even smaller than the airplane. As a result, the weights were concentrated on these smaller objects, making it difficult to rank similar airport images effectively.

Fig. 7.
The results of image retrieval for airport image.
Fig. 8.
The results of mask image retrieval for airport image.

Additionally, limitations in the ability of segmentation model to detect small objects were evident. For instance, while the airplane was successfully identified in the mask of the query image, none of the retrieved images included airplanes in their segmentation masks, even when airplanes were clearly present in the source images. This discrepancy highlights a critical limitation: when segmentation models fail to reliably identify small but significant objects, the effectiveness of structural information in enhancing image retrieval is significantly diminished.

For the three datasets with distinct characteristics, we conducted experiments to evaluate the effectiveness of combining structural information with global features (Table 1). In the Paris and Caltech256 datasets, qualitative evaluations confirmed that integrating structural information enhances the performance of similar image searches. The addition of the SED proved effective in capturing the most defining objects and background regions within the images, improving retrieval relevance and accuracy. However, for the AID dataset, the structural information did not provide significant benefits.

Table 1.
Comparison of image retrieval weighted accuracy using VGG and SED combined

In the quantitative evaluation, five images per class were used as query images, and a database of 50 images per class was employed for the image retrieval task. For the Paris and Caltech256 datasets, the addition of SED combined with global features demonstrated a clear improvement in retrieval performance over the VGG-only method. This suggests that structural features, particularly in landmark and object-centric datasets, provide valuable complementary information that enhances the ability of models to correctly identify and rank similar images (Table 2).

However, the AID dataset, which contains complex aerial images, presented a different challenge. The structural information, particularly derived from the SED method, was less effective in improving retrieval performance compared to the global VGG features. This can be attributed to the difficulty in detecting smaller or obscured objects in aerial imagery, which limited the potential benefits of incorporating structural information. As a result, the SED combined method performed worse in the AID dataset than the VGG-only method, highlighting the importance of dataset characteristics in determining the effectiveness of structural feature integration.

Table 2.
Performance comparison using Precision@5 and Precision@10 (%)
4.4 Ablation Studies

To further justify our use of SAM over traditional segmentation models, we present a visual comparison with Mask R-CNN using the same query image in Fig. 9. As illustrated, SAM generates dense, class-agnostic masks capturing both foreground and background structures, whereas Mask R-CNN only highlights predefined objects such as “dog” and “person.” This comparison shows the limitations of category-dependent models in structure-oriented retrieval tasks.

To evaluate the effectiveness of the proposed region-size–based mask weighting strategy, we conducted an ablation study comparing it with a baseline saliency-based approach. In our framework, each mask produced by the SAM corresponds to a distinct object or region. We assign a weight to each mask proportional to its area, based on the assumption that larger regions are more likely to represent semantically and structurally prominent parts of the image. This strategy is simple, interpretable, and fully compatible with the class-agnostic nature of SAM.

In contrast, saliency-based approaches typically generate a single heatmap that highlights visually dominant areas. While such methods are effective in visual classification or class-specific localization tasks, they are not designed to assign independent weights to multiple segmented masks. To adapt saliency weighting to our setting, we applied a vanilla saliency method to the input image and calculated the average saliency value within each mask region as its weight. However, this approach revealed several limitations:

· Lack of instance-level discrimination: The saliency map provides global attention cues but cannot distinguish between individual segmented regions.

· Semantic bias: Saliency estimation is influenced by class-related visual features, which conflicts with our unsupervised, structure-oriented retrieval goal.

· Unstable weighting: Small but visually intense regions may receive disproportionately high weights, while large structural components may be underweighted.

As shown in Fig. 10, our region-size-based strategy yields a more balanced and structurally relevant distribution of weights across masks, which aligns more effectively with the objectives of structure-aware, class-agnostic image retrieval. These findings validate our design choice and clarify why direct comparison with saliency-based methods is not only impractical but also misaligned with the core philosophy of our framework.

Fig. 9.
Comparison between SAM and Mask R-CNN segmentation outputs.
Fig. 10.
Comparison between saliency map and region-wise mask weighting.

5. Conclusion

This study introduced a novel image retrieval approach that integrates global descriptors with SED. The incorporation of structural information improves the ability to analyze aspects of images that traditional global descriptors often fail to capture, particularly in defining object shapes and their surrounding background. By combining the VGG-based model with SED, our approach enhances the ability to distinguish images with similar shapes across diverse classes and contributes to the optimization of image retrieval systems. However, a key limitation of the proposed method arises when handling images containing small or partially obscured objects, particularly in complex or cluttered backgrounds, where the segmentation model may struggle to identify relevant features.

Future work will focus on refining methods for controlling the integration of structural information with global features. Specifically, we plan to conduct a more detailed analysis of how the combined features affect image retrieval across various image types, including object-centric, background-centric, and hybrid images that contain both object and background elements. By further optimizing the integration of these features, we aim to enhance the accuracy and flexibility of image retrieval systems in real-world applications. Moreover, we aim to extend our framework by incorporating saliency or attention guided weighting strategies alongside object detection models. This would allow the system to selectively emphasize semantically meaningful regions while maintaining the flexibility of structural retrieval, potentially improving performance in object-centric tasks and complex scenes.

Conflict of Interest

The authors declare that they have no competing interests.

Funding

This research was supported by the National Research Council of Science & Technology (NST) grant by the Korea government (MSIT) (No. GTL24031-000) and the research program of KISTI (Korea Institute of Science and Technology Information).

Biography

Gunho Lee
https://orcid.org/0000-0003-3400-1901

He received the bachelor’s degree in statistics from Iowa State University, Iowa, USA, in 2017. He is currently pursuing the Ph.D. degree in data and HPC with University of Science and Technology. He is also a student researcher with Korea Institute of Science and Technology Information. His research interests include image retrieval and anomaly detection using machine learning/deep learning algorithms.

Biography

Minjoong Jeong
https://orcid.org/0000-0003-4683-4345

He received the Ph.D. degree in frontier sciences (mechanical design) from the University of Tokyo, Tokyo, Japan, in 2004. He is currently a leader of the Department of Supercomputing Acceleration Research, Korea Institute of Science and Technology Information (KISTI). He is also a professor with the University of Science and Technology (UST), South Korea. His research interests include multiobjective/ multicriterion optimization, data clustering/pattern recognition using machine learning, and evolutionary algorithms.

References

  • 1 S. R. Dubey, "A decade survey of content-based image retrieval using deep learning," IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2687-2704, 2022. https://doi.org/10. 1109/TCS VT.2021.3080920doi:[[[10.1109/TCSVT.2021.3080920]]]
  • 2 V . Mezaris, I. Kompatsiaris, and M. G. Strintzis, "An ontology approach to object-based image retrieval," in Proceedings 2003 International Conference on Image Processing (Cat. No. 03CH37429), Barcelona, Spain, 2003. https://doi.org/10.1109/ICIP.2003.1246729doi:[[[10.1109/ICIP.2003.1246729]]]
  • 3 P. Jiang, D. Ergu, F. Liu, Y . Cai, and B. Ma, "A review of Yolo algorithm developments," Procedia Computer Science, vol. 199, pp. 1066-1073, 2022. https://doi.org/10.1016/j.procs.2022.01.135doi:[[[10.1016/j.procs.2022.01.135]]]
  • 4 K. He, G. Gkioxari, P. Dollar, and R. Girshick, "Mask R-CNN," in Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2961-2969. https://doi.org/10.1109/ICCV .2017.322doi:[[[10.1109/ICCV.2017.322]]]
  • 5 H. Qassim, A. Verma, and D. Feinzimer, "Compressed residual-VGG16 CNN model for big data places image recognition," in Proceedings of 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las V egas, NV , USA, 2018, pp. 169-175. https://doi.org/10.1109/CCWC.2018.8301729doi:[[[10.1109/CCWC.2018.8301729]]]
  • 6 A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, et al., "Segment anything," in Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2023, pp. 3992-4003. https://doi.org/10.1109/ICCV51070.2023.00371doi:[[[10.1109/ICCV51070.2023.00371]]]
  • 7 D. G. Lowe, "Distinctive image features from scale-invariant keypoints," International Journal of Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. https://doi.org/10.1023/B:VISI.0000029664.99615.94doi:[[[10.1023/B:VISI.0000029664.99615.94]]]
  • 8 G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray, "Visual categorization with bags of keypoints," 2004 (Online). Available: https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf.custom:[[[https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf]]]
  • 9 S. Minaee, Y . Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, "Image segmentation using deep learning: a survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3523-3542, 2022. https://doi.org/10.1109/TPAMI.2021.3059968doi:[[[10.1109/TPAMI.2021.3059968]]]
  • 10 J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y . Zhang, and J. Li, "Deep learning for content-based image retrieval: a comprehensive study," in Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 2014, pp. 157-166. https://doi.org/10.1145/2647868.2654948doi:[[[10.1145/2647868.2654948]]]
  • 11 Y . Dong, M. Li, and J. Li, "Image retrieval based on improved Canny edge detection algorithm," in Proceedings 2013 International Conference on Mechatronic Sciences, Electric Engineering and Computer (MEC), Shengyang, China, 2013, pp. 1453-1457. https://doi.org/10.1109/MEC.2013.6885296doi:[[[10.1109/MEC.2013.6885296]]]
  • 12 J. Z. Wang, J. Li, and G. Wiederhold, "SIMPLIcity: semantics-sensitive integrated matching for picture libraries," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 9, pp. 947-963, 2001. https://doi.org/10.1109/34.955109doi:[[[10.1109/34.955109]]]
  • 13 G. Raghuwanshi and V . Tyagi, "A novel technique for content based image retrieval based on region-weight assignment," Multimedia Tools and Applications, vol. 78, no. 2, pp. 1889-1911, 2019. https://doi.org/10.1007/ s11042-018-6333-6doi:[[[10.1007/s11042-018-6333-6]]]
  • 14 H. Zhao, J. Wu, D. Zhang, and P. Liu, "toward improving image retrieval via global saliency weighted feature," ISPRS International Journal of Geo-Information, vol. 10, no. 4, article no. 249, 2021. https://doi.org/10.3390/ ijgi10040249doi:[[[10.3390/ijgi10040249]]]
  • 15 F. Meng, D. Shan, R. Shi, Y . Song, B. Guo, and W. Cai, "Merged region based image retrieval," Journal of Visual Communication and Image Representation, vol. 55, pp. 572-585, 2018. https://doi.org/10.1016/j.jvcir. 2018.07.003doi:[[[10.1016/j.jvcir.2018.07.003]]]
  • 16 Z. Hu and A. G. Bors, "Co-attention enabled content-based image retrieval," Neural Networks, vol. 164, pp. 245-263, 2023. https://doi.org/10.1016/j.neunet.2023.04.009doi:[[[10.1016/j.neunet.2023.04.009]]]
  • 17 J. Zhang, D. Feng, Z. Wu, and H. Liu, "Dual-branch feature fusion vision transformer for garbage image classification," in Proceedings of 2023 8th International Conference on Computer and Communication Systems (ICCCS), Guangzhou, China, 2023, pp. 950-955. https://doi.org/10.1109/ICCCS57501.2023.10150982doi:[[[10.1109/ICCCS57501.2023.10150982]]]

Table 1.

Comparison of image retrieval weighted accuracy using VGG and SED combined
Dataset Accuracy (%)
VGG SED combined
Paris 74.6 82.9
Caltech256 80.8 84.2
AID 44.5 42.7

Table 2.

Performance comparison using Precision@5 and Precision@10 (%)
Dataset Precision@5 (%) Precision@10 (%)
VGG SED combined VGG SED combined
Paris 68.0 76.8 70.1 79.4
Caltech256 73.6 78.2 77.2 81.9
AID 51.2 39.2 52.5 40.3
Structure elements descriptor generation.
Combined query-based image retrieval through structure elements descriptor.
Blending and weighted strategy
Retrieval results with varying SED blending ratios (0.2, 0.5, 0.8) compared to the VGG-only.
The results of image retrieval for Sacré-Coeur image.
The results of masked image retrieval for Sacré-Coeur image.
The results of image retrieval for octopus image.
The results of image retrieval for airport image.
The results of mask image retrieval for airport image.
Comparison between SAM and Mask R-CNN segmentation outputs.
Comparison between saliency map and region-wise mask weighting.