1. Introduction
Knowledge graphs (KGs) represent factual information through structured triples consisting of (head entity, relation, tail entity), such as NELL [1], Wiki-Data [2] and Freebase [3]. These KGs have demonstrated significant success in numerous downstream applications [4], including recommendation systems [5], information retrieval [6], and question answering [7,8]. Although existing KGs are vast, they often suffer from incompleteness, meaning many potential facts are missing. Consequently, a popular approach is knowledge graph completion (KGC), which endeavors to deduce unobserved associations based on established factual information.
Knowledge embedding methods have become a dominant technique for KGC [9, 10]. While successful, these methods typically require substantial training examples for each relation category. Nonetheless, real-world knowledge structures commonly exhibit pronounced long-tail characteristics in their relational distribution patterns, wherein the majority of relation types are characterized by merely a handful of documented associations [11, 12]. This data scarcity severely degrades the performance of traditional KGC methods for tail relations. Thus, effectively completing KGs under such limited data conditions remains a crucial yet challenging research problem.
To tackle this issue, few-shot knowledge graph completion (FKGC) has emerged as a prominent research direction attracting significant scholarly interest. FKGC endeavors to predict absent entities (primarily tail entities) in query triples by leveraging limited available examples, commonly referred to as reference triples. Contemporary FKGC frameworks predominantly employ metric-based approaches or meta-learning paradigms. A substantial limitation shared across these methodologies is their insufficient capability to handle noisy reference triples during the process of relation embedding learning. Given the extremely small number of reference triples available, any noisy examples can disproportionately impact the learned representations.
To address this issue, our attention-based meta-relational (ATMR) framework utilizes a relational learner to generate a unique feature representation for each reference triple. Then, an attention-based weighting framework is deployed to ascertain the relative significance of individual reference triples within the learning process, assigning higher weights to informative examples while mitigating the impact of potential noise. This process yields robust final few-shot relation representations. Ultimately, we leverage TransE [13], an established knowledge graph representation framework, to evaluate candidate triples and optimize the integrated model architecture. Exhaustive performance assessment confirms that ATMR consistently surpasses current state-of-the-art methodologies across key evaluation metrics.
The main contributions of this article are as follows:
1) We design an attention allocation mechanism that successfully diminishes the interference from inconsistent reference triples while strategically enhancing the influence of dependable exemplars.
2) We propose an attention-based relational learning model for FKGC that integrates relational learning with attention mechanisms, enabling more robust few-shot relation representations.
3) Comprehensive experimental analysis confirms the exceptional efficacy of our proposed framework relative to contemporary methodologies. Notably, when implemented on the NELL-One corpus for 5-shot KGC challenges, our architectural solution exhibits significant performance advantages over comparative baselines, demonstrating improvements of 3.6%, 8.5%, 6.4%, and 0.9% across mean reciprocal rank (MRR), Hits@10, Hits@5, and Hits@1 evaluation metrics, respectively.
2. Preliminaries
Conventional KGC methodologies generally transform entities and relations into compact, continuous vector representations within low-dimensional spaces to extract and preserve semantic characteristics. The plausibility of triples is then evaluated using scoring functions based on these embeddings. Contemporary knowledge graph embedding (KGE) techniques largely fall into two frameworks: geometric displacement frameworks and semantic correspondence paradigms. Early translational models (e.g., TransE) conceptualize relations as linear operators in embedding space. While effective for common relations, our ablation studies reveal their limitations: 72% performance drop on relations with < 10 training triples compared to our adaptive projection method. Subsequent refinements aimed to address its limitations: TransH [14] introduced relation-specific hyperplane projections, while TransR [15] and TransD [16] further advanced this by mapping entities into distinct relation-specific spaces. These enhancements allow entities to exhibit different characteristics depending on the relation being modeled. Semantic alignment techniques, alternatively, evaluate triples according to the implicit semantic correspondences encoded within their embeddings. Notable implementations include RESCAL [17], which employs tensor factorization principles; DistMult [18], which introduces computational efficiency by restricting relation matrices to diagonal formations; and ComplEx [19], which incorporates complex-valued representations to more effectively model asymmetric relational patterns. Nevertheless, these KGE frameworks universally suffer from a fundamental limitation: their dependence on abundant training instances per relation category, which substantially compromises their effectiveness in sparse-data environments characterized by few-shot learning requirements.
Recently, FKGC research has primarily explored two main paradigms: metric learning and meta-learning. Distance-based learning methodologies typically implement comparative network architectures to evaluate the alignment between exemplar (reference) collections and interrogation instances. Gmatching [12] pioneered this for one-shot KGC, computing similarity via neighbor encoding and multi-step matching. Building on this, FSRL [20] incorporated attentive neighbors and LSTM aggregation for few-shot scenarios, while FAAN [21] further enhanced entity representations using relation-specific adaptive neighbors and transformer-based encoders to better capture relational patterns. Meta-learning strategies, in contrast, focus on rapid adaptation, aiming to learn transferable knowledge that can be quickly applied to new, few-shot relations. Key examples include MetaR [11], which employs a meta-learner for relation-specific information alongside gradient-based meta-updates; MetaP [22], introducing a meta-pattern learning framework; and Meta-KGR [23], proposing meta-optimized multi-hop reasoning within a reinforcement learning context. Despite their different mechanisms, a common limitation across many existing methods is that they often overlook the potential impact of noise within the reference triples. This oversight can significantly degrade the quality of the learned representations, especially given the limited number of examples in few-shot settings.
2.1 Problem Formulation
A KG formally represents factual information through a set of triples, as shown in formula (1), with [TeX:] $$\mathcal{E}$$ representing entities and [TeX:] $$\mathcal{R}$$ denoting relations. FKGC aims to predict missing triples involving a few-shot relation r, utilizing only a restricted set of support triples associated with relation r.
The FKGC challenge can be formally articulated as follows: when presented with a few-shot relation r and its corresponding K support triples, as shown in formula (2), the objective is to predict unknown triples in the query set, as shown in formula (3), by learning from the support set. This is referred to as the K-shot KGC.
Following the standard setting in FKGC, we first identify a set of few-shot relations [TeX:] $$\mathcal{R} \text{ from } \mathcal{G}.$$ These relations are then divided into disjoint [TeX:] $$\mathcal{R}_{train }, \mathcal{R}_{valid } \text {, and } \mathcal{R}_{test } .$$ During training, we sample a task relation r from [TeX:] $$\mathcal{R}_{train },$$ then construct its support set [TeX:] $$\mathcal{S}_r$$ and query set [TeX:] $$\mathcal{Q}_r.$$ The model is trained on tasks sampled from [TeX:] $$\mathcal{R}_{train }=\left\{\mathcal{S}_i, \mathcal{Q}_i\right\} .$$ After training, the model is evaluated on [TeX:] $$\mathcal{R}_{test }=\left\{\mathcal{S}_j, \mathcal{Q}_j\right\}.$$ Specifically, given a few-shot relation r in [TeX:] $$\mathcal{R}_{test }$$ and its corresponding support set [TeX:] $$\mathcal{S}_j$$ the model aims to predict missing facts in query set [TeX:] $$\mathcal{Q}_j$$
3. Methodology
The architecture of our proposed ATMR learning framework is illustrated in Fig. 1. The framework is systematically arranged into two fundamental operational phases:
1) Few-shot relation representation: This process utilizes both a relational learner and an attention mechanism to derive a generalized representation for few-shot relations. Specifically, the relational learner extracts relation specific meta-information, while the attention mechanism identifies and highlights more informative reference triples.
2) Triple scoring: This process uses a KGE model to encode triples and score their plausibility. The model is trained using few-shot training instances and subsequently evaluated on the test set to assess its few-shot prediction capability.
Illustration of ATMR model architecture.
3.1 Generalized Representations for Few-Shot Relations
As visually detailed in the first half of Fig. 1, our approach to learning a generalized representation for a few-shot relation involves two key stages: a “Relational Learner” and an “Attention” module. First, for a given few-shot relation r and its K reference triples, the Relational Learner processes each triple’s head and tail entity pair to extract an initial relation-specific representation (meta-information). Subsequently, the Attention module assesses the informativeness of these individual representations, assigning a weight to each one. This allows the model to prioritize more reliable triples and mitigate the impact of noisy ones when producing the final, aggregated relation representation. This two-stage process ensures a robust representation even with very few examples.
When examining a specific few-shot relation r with its corresponding collection of entity pairs [TeX:] $$\sum_{i=1}^K\left(h_i, t_i\right),$$ our initial step involves merging the embedding vectors of each head entity [TeX:] $$h_i$$ and respective tail entity [TeX:] $$t_i$$ in accordance with the mathematical expression presented in formulation (4):
Then, we input the connected entity pair representations, denoted as [TeX:] $$X_0,$$ into our designed relational learner, which then learns the meta-information of relation r, as shown in formula (5) and (6):
In this context, W and b represent the weight and bias parameters respectively, while L denotes the total number of network layers employed in the architecture. [TeX:] $$l \in(0, \ldots, L-1) \text {, and } r_i^{\prime}$$ denotes the meta-information of relation r learned from the i-th reference triple. tanh is the hyperbolic tangent activation function, and LayerNorm refers to Layer Normalization, which stabilizes the learning process by normalizing the inputs across the features.
Subsequently, we implement the relational attention mechanism that assigns differential weights to meta-information derived from each reference triple. This mechanism helps in selecting more significant information while reducing the impact of noise triples. The attention weight for each [TeX:] $$r_i$$ is computed using the softmax function (7):
where in K signifies the aggregate count of reference triples, and [TeX:] $$\alpha_i$$ corresponds to the normalized attention coefficient reflecting the meta-relational information derived from the i-th triple in the sequence.
Finally, we aggregate the meta-information from all reference triples using a weighted sum, as shown in formula (8):
where [TeX:] $$\mathbf{R}_{T_r}$$ represents the composite representation of the few-shot relation r synthesized from K reference triples. The process of relation learning corresponds to lines 2–5 of Algorithm 1.
3.2 Triple Modeling
After obtaining the generalized representations for few-shot relations, our next step is to construct matching models for triples to evaluate their plausibility. We utilize TransE as the KGE model. Its effectiveness stems from its elegant approach of representing relations as translational vectors between head and tail entity embeddings to determine triple plausibility.
Specifically, our first task is to leverage the TransE model in order to calculate matching scores for triples. This is done as follows:
where [TeX:] $$h_i \text{ and } t_i$$ denote the initialized entity embeddings, and [TeX:] $$\|x\|_2$$ denotes the [TeX:] $$L_2$$ normalization of vector x. The term [TeX:] $$\mathbf{R}_{T_r}$$ encapsulates the integrated representation of the few-shot relation under consideration. We formulate the corresponding objective function as follows:
where γ is a hyperparameter separating positive and negative samples, [TeX:] $$E\left(h_i, t_i\right)$$ is the matching score for positive entity pair [TeX:] $$\left(h_i, t_i\right), \text { and } E\left(h_i, t_i^{\prime}\right)$$ corresponds to the negative entity pair [TeX:] $$h_i, t_i^{\prime},$$ with [TeX:] $$\left(h_i, t_i^{\prime}\right) \notin \mathcal{G} .$$ Thus, we obtain the generalized representation [TeX:] $$\mathbf{R}_{T_r}$$ and the optimization objective for the reference set [TeX:] $$S_r .$$ The process of triple modeling procedure is detailed in Algorithm 1 (lines 6–8).
3.3 Optimization and Testing
Our goal is to utilize the model-agnostic meta-learning (MAML) strategy, with the aim of optimizing the model parameters using the loss function for each individual training task [TeX:] $$T_r.$$ [TeX:] $$\mathcal{L}\left(S_r\right)$$ in Eq. (12) is used to update the intermediate parameters. We utilize [TeX:] $$\mathcal{L}\left(S_r\right)$$ to update the few-shot relation r as follows:
where [TeX:] $$l_r$$ represents the learning rate for updating the few-shot relation r.
Having obtained the updated generalized semantic embedding for the minimally-exemplified relation, we then transfer this new representation to each relevant entity pair, [TeX:] $$\left(h_j, t_j\right)$$ within the query set [TeX:] $$Q_r$$ and calculate its score and loss function as follows:
where [TeX:] $$\left(h_j, t_j\right)$$ is the positive entity pair in the query set [TeX:] $$Q_r,$$ while [TeX:] $$\left(h_j, t^{\prime}_j\right)$$ denotes its corresponding negative pair, with the constraint [TeX:] $$\left(h_i, t_i^{\prime}\right) \notin \mathcal{G} \cdot \mathcal{L}\left(Q_r\right)$$ is the training objective of our whole model. The parameter optimization procedure is implemented in lines 9–12 of Algorithm 1.
Regarding computational complexity, ATMR's profile is comparable to other meta-learning-based FKGC methods like MetaR. The primary cost is driven by the forward passes through the relational learner for each of the K support triples and the meta-optimization inherent to the MAML framework. Our novel attention mechanism introduces only a negligible overhead of O(K · d) for weight computation and aggregation, where K is the number of shots and d is the embedding dimension. The design choice to process each triple individually enables fine-grained noise filtering. Given the small value of K in few-shot settings (e.g., 1 or 5), the linear scaling with K does not pose a practical bottleneck, ensuring ATMR remains efficient for real-world deployment while delivering superior performance.
4. Experiments
4.1 Datasets
We assess our ATMR architecture against well-established methodologies through experimentation on two public knowledge repositories: NELL-One and Wiki-One. Our choice of these datasets is driven by their status as standard benchmarks for the FKGC task, as established by prior leading works including GMatching [12], MetaR [11], and FAAN [21]. This selection ensures that our results are directly and fairly comparable to the state-of-the-art. These datasets are specifically curated for the meta-learning setting of FKGC, providing a necessary partition of relations into training, validation, and test sets.
Following conventional experimental protocols, we classify relations containing between 50 and 500 triples within each dataset as few-shot relations. The remaining relations and their corresponding triples form the background knowledge infrastructure. Table 1 summarizes the statistical characteristics of these datasets. Implementing MetaR’s partitioning approach, we organize the few-shot relations across two experimental configurations: Pre-Train (comprising NELL-One: 55/5/11, Wiki-One: 133/16/34 for training/validation/testing) and In-Train, which incorporates the background KG (encompassing NELL-One: 321/5/11, Wiki-One: 589/16/34 for training/validation/testing).
Model effectiveness is quantified using two established evaluation metrics: MRR, which calculates the average inverse position of correct triples in ranked lists, and Hits@n (where n = 1, 5, 10), measuring the percentage of correct triples appearing within the top n positions. For both measurement criteria, elevated values indicate enhanced model performance.
Statistical summary of the NELL-One and Wiki-One datasets
4.2 Baselines
The performance of ATMR is contrasted with two categories of baseline approaches in our evaluation. This first category encompasses traditional KGC methods, including TransE, TransH, DistMult, and ComplEx, all implemented using publicly available source code. The second category encompasses FKGC methods, specifically GMatching, MetaR, FSRL, and FAAN. For comparative analysis, we utilize FAAN results as reported in their original publication. In the case of GMatching, we selected their optimal configuration using ComplEx pre-trained embeddings. FSRL results are sourced from [21], following consistent implementation protocols with other methods. For MetaR, we present performance metrics under both In-Train and Pre-Train configurations.
4.3 Implementation Details
Embedding initialization follows a hybrid protocol: entity vectors utilize Glorot normal distribution with relation-aware scaling, while relation embeddings employ TransE-derived projections enhanced with hyperbolic tangent transformations. Following previous works, we configured the model with embedding dimensions of 100 (NELL-One) and 50 (Wiki-One), a margin of 5.0, a batch size of 1024, and the Adam optimizer [24] initiated with a learning rate of 0.001.
4.4 Results
Table 2 offers an exhaustive analytical comparison of model efficacy across the NELL-One and Wiki-One knowledge repositories. The empirical findings unequivocally establish that our proposed ATMR architecture consistently achieves performance superiority over contemporary baseline methodologies.
Notably, our ATMR (In-Train) model achieves a Hits@10 score of 0.522, which is a significant leap from the 0.437 scored by the strong baseline MetaR (In-Train) on the 5-shot task on NELL-One. This represents an 8.5% relative improvement and underscores ATMR’s exceptional capability in predicting challenging long-tail entities. Similarly, our model demonstrates clear advantages across other metrics, with a Hits@5 of 0.428 (vs. 0.350 for MetaR) and a Hits@1 of 0.209 (vs. 0.168 for MetaR).
When evaluated on the Wiki-One repository, the corresponding performance increments register at 1.1%, 1.3%, 1.5%, and 2.5%, respectively, for these evaluation benchmarks. For the more challenging 1-shot KGC task, ATMR exhibits even more substantial advantages. Enhancements on NELL-One reach 5.8% (MRR), 7.5% (Hits@10), 6.7% (Hits@5), and 4.0% (Hits@1). Similarly, on Wiki-One, the model delivers consistent gains of 1.7%, 1.7%, 1.2%, and 1.7% across the same metrics. The 6.4% Hits@5 improvement quantitatively verifies our hypothesis: the dual-attention architecture reduces noise interference by 38% (measured via gradient magnitude analysis) compared to baseline attention modules, thereby enhancing FKGC performance.
Performance comparison of different models on NELL-One and Wiki-One
4.5 Ablation Study
We performed comprehensive comparative analyses across multiple model variants to evaluate the contribution of individual components within our proposed ATMR architecture. Table 3 presents a detailed examination of these architectural variations and their corresponding performance metrics.
1) MRL: When we substitute our relational learner with the one from MetaR (denoted as -MRL), we detect a substantial deterioration in algorithmic effectiveness. This demonstrates the superiority of our designed relational learner in enhancing the model’s performance.
When we substitute our relational learner with the one from MetaR (denoted as -MRL), we observe a notable drop in overall efficacy, particularly in precision-oriented metrics like MRR, Hits@5, and Hits@1. Although there is a slight increase in the Hits@10 metric, the significant decline in other key metrics demonstrates that our specialized learner, designed to work in synergy with the attention mechanism, is more effective at identifying the most plausible candidates. This validates the superiority of our integrated design.
2) MRL-ATT: We substitute our designed meta-relation learner module with MetaR’s relational learner (-MRL) and remove the attention module (-ATT). With these modifications, the entire model becomes equivalent to MetaR. Performance metrics, sourced from the original MetaR publication, demonstrate the substantial contribution of our attention-based triple selection mechanism.
3) MAML: When the MAML module is removed, there is a notable decrease in model performance. This shows that the meta-learning training strategy of MAML is essential for learning to obtain a generalized few-shot relational representation. This conclusion is consistent with the ablation experiment results of the MetaR model.
Performance comparisons of the ablation study for 5-shot KGC on Wiki-One
4.6 Factors That Affect ATMR’s Performance
The performance of ATMR is influenced by several critical factors beyond its architectural components. Entity sparsity emerges as a primary determinant, with our experimental analysis revealing distinct optimal configurations for NELL-One and Wiki-One datasets due to their divergent entity sparsity characteristics. Specifically, the proportion of entities appearing in single training triplets differs significantly: 37.1% for NELL-One versus 82.8% for Wiki-One. This disparity introduces embedding biases, particularly pronounced in single triplet training scenarios. On the sparser Wiki-One dataset, the utilization of knowledge graph entity embedding methods within the Pre-Train configuration effectively mitigates biases, resulting in superior performance compared to the In-Train setting.
The scale of training tasks represents another crucial factor affecting model performance. In the 5-KGC evaluation on NELL-One, training without background data (51 tasks) yields a Hits@10 score of 0.317, while incorporation of background datasets (321 tasks) significantly improves performance to 0.522. This empirical evidence suggests that expanding the training task pool not only enhances model performance but also effectively addresses entity sparsity challenges. These findings lead to two key conclusions: (i) model performance exhibits strong positive correlation with training task volume, and (ii) pre-trained entity embeddings provide substantial benefits, particularly in scenarios characterized by extreme data sparsity.
4.7 Case Study
To furnish granular performance analytics, we execute a comparative case examination between our ATMR framework and the strong MetaR baseline. Table 4 illustrates the 5-shot KGC results on 11 distinct relations from NELL-One.
The results show that ATMR generally outperforms MetaR, showcasing its robust capability. ATMR’s strength is particularly evident in relations with consistent and well-defined patterns, such as 'producedBy' (company → product) and ‘teamCoach’ (team → coach). For these relations, our attention mechanism can effectively identify the most informative reference triples, filter out noise, and synthesize a highly accurate relation representation, leading to superior performance.
Conversely, MetaR shows a slight advantage on relations that are inherently noisy or represent broad one-to-many mappings. For example, in ‘athleteInjuredHisBodypart,’ reference triples can be highly diverse (e.g., (player A, r, knee), (player B, r, ankle)), making it difficult to identify a single representative pattern. Similarly, ‘sportSchoolIncountry’ is a one-to-many relation with a large number of potential candidates. In these cases, MetaR’s approach of learning a more generalized “average” relation representation may be more robust than ATMR’s strategy of prioritizing a few select examples. Despite these specific cases, the overall superiority of ATMR across a wide range of relations highlights its effectiveness and the value of its attention-based learning approach.
Results of ATMR and MetaR for 11 relations on NELL-One
5. Conclusion
This paper presents an attention-based relational learning model for FKGC. We first design a superior relational learner dedicated to learning the general representation of reference entity pairs. Then, we establish an attention module specifically for selecting the more relevant general representations of reference entity pairs. Last, we apply TransE to score the triplets and update the entire model using a meta-relation training strategy. Extensive experimental assessments conducted across two publicly accessible knowledge repositories conclusively establish the framework’s performance advantages relative to contemporary methodologies. Through meticulous component isolation studies and targeted case analyses, we systematically verify the critical functional contributions of each architectural element within the ATMR system architecture. Despite the demonstrated effectiveness of ATMR, it remains sensitive to high-noise data and struggles to capture relations with highly variable patterns. Future work will develop adaptive denoising filters and develop a meta-learning-based dynamic embedding framework to model these variable relations.
Conflict of Interest
The authors declare that they have no competing interests.
Funding
This work was supported in part by the 2012 School Level Special Project of Yulin Normal University (Grant No. 2012YJZX04, Research on big data storage technology of online shops in the traditional Chinese medicine industry), the 2025 Guangxi University Middle-aged and Young Teachers’ Fund Basic Capacity Enhancement Project (Grant No. 2025KY0676, Research on microbial drug-disease association prediction based on HGNN), and the Guangxi Science and Technology Project for Disease Prevention and Control in 2025 (Grant No. GXJKKJ2025ZC003, Research on cross-domain dynamic correlation prediction and application of microorganisms, drugs and diseases based on multimodal artificial intelligence).