🌕

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

FEBRUARY 25, 2025

Ximei Xu, ximeixu79@gmail.com

🐣

In this blog, I will introduce the paper "Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction". This paper presents Extract-Define-Canonicalize (EDC), an LLM-based three-phase framework that addresses the problem of knowledge graph construction by open information extraction followed by post-hoc canonicalization.

Introduction & Background

This paper proposes an approach to constructing knowledge graphs that leverages large language models (LLMs) in a structured manner.

Knowledge Graph Construction

Open Information Extraction and Canonicalization

Standard (closed) information extraction: It requires the output triplets to follow a pre-defined schema, e.g. a list of relation or entity types to be extracted from.

Open information extraction (OIE): It does not have such a requirement.

Recent studies have found LLMs to exhibit excellent performance on OIE tasks.

  • Problem: The relational triplets extracted from OIE systems are not canonicalized, causing redundancy and ambiguity in the induced open knowledge graph.
  • Current solutions: An extra canonicalization step is required to standardize the triplets.
    • In case a target schema is present: “Alignment”.
    • In case no target schema is available: Clustering.
      • However, clustering-based methods are prone to over-generalization, e.g., it may put “is brother of”, “is son of”, “is main villain of”, and “was professor of” into the same relation cluster.

Extract-Define-Canonicalize (EDC): It is more general compared to the existing canonicalization methods.

Method

The proposed LLM-based three-phase framework is called Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. Given input text, the goal here is to extract relational triplets in a canonical form such that the resulting KGs will have minimal ambiguity and redundancy. When there is a predefined target schema, all generated triplets should conform to it. In the scenario where there is not one, the system should dynamically create one and canonicalize the triplets with respect to it. Figure 1 shows an overview of the framework.

Figure 1: A high-level illustration of Extract-Define-Canonicalize (EDC) for Knowledge Graph Construction.

In this section, I will first introduce the EDC framework followed by a description of refinement (EDC+R).

EDC: Extract-Define-Canonicalize

At a high level, EDC decomposes KGC into three connected subtasks.

👨🏼‍🚀

In the following discussion, a specific input text example will be used for better illustration: “Alan Shepard was born on Nov 18, 1923 and selected by NASA in 1959. He was a member of the Apollo 14 crew.”

Phase 1: Open Information Extraction

Firstly Large Language Models (LLMs) are leveraged for open information extraction. Through few-shot prompting, LLMs identify and extract relational triplets ([Subject, Relation, Object]) from input texts, independent of any specific schema. Using the example above, the prompt is:

Figure 2: An example of OIE Prompt.

The resultant triplets form an open KG, which is forwarded to subsequent phases.

👨🏼‍🚀

In this case, the resultant triplets are [‘Alan Shepard’, ‘bornOn’, ‘Nov 18, 1923’], [‘Alan Shepard’, ‘participatedIn’, ‘Apollo 14’].

Phase 2: Schema Definition

Next, the LLMs are prompted to provide a natural language definition for each component of the schema induced by the open KG:

Figure 3: An example of Schema Definition Prompt.

The resultant definitions for the schema components are then passed to the next stage as side information used for canonicalization.

👨🏼‍🚀

This example prompt results in the definitions for (bornOn: The subject entity was born on the date specified by the object entity.) and (participatedIn: The subject entity took part in the event or mission specified by the object entity.).

Phase 3: Schema Canonicalization

The third phase aims to refine the open KG into a canonical form, eliminating redundancies and ambiguities. It starts by vectorizing the definitions of each schema component using a sentence transformer to create embeddings. Canonicalization then proceeds in one of two ways, depending on the availability of a target schema:

The prompt for the example is:

Figure 4: An example of Schema Canonicalization Prompt.

Note that the choices above are obtained by using vector similarity search. After the LLM makes its choice, the relations are transformed to yield new triplets, which forms the canonicalized KG.

👨🏼‍🚀

In the example, the new triplets are [‘Alan Shepard’, ‘birthDate’, ‘Nov 18, 1923’], [‘Alan Shepard’, ‘mission’, ‘Apollo 14’].

EDC+R: Iteratively refine EDC with Schema Retriever

The refinement process leverages the data generated by EDC to enhance the quality of the extracted triplets. The authors construct a “hint” for the extraction phase, which comprises two main elements:

Figure 6: An example of Refined OIE Prompt.

By doing so, it can provide a richer pool of candidates for the LLM, which addresses issues where the absence of entities or relations impairs the LLM’s effectiveness; it also serves to aid the OIE by bootstrapping from the previous round.

Schema Retriever: To scale EDC to large schemas, a trained Schema Retriever is employed to help search schemas efficiently. It works in a similar fashion to information retrieval methods based on vector spaces; it projects the schema components and the input text to a vector space such that cosine similarity captures the relevance between the two, i.e., how likely a schema component to be present in the input text.

👨🏼‍🚀

Back to the example, refinement with the schema retriever adds the following relation to the previous set: [‘Alan Shepard’, ‘selectedByNasa’, ‘1959’]. The relation ‘selectedByNasa’ is rather obscure but was specified in the target schema.

Experiments and Findings

Experimental Setup

  1. Datasets

    This study evaluates EDC using three KGC datasets:

    • WebNLG: It contains 1165 pairs of text and triplets. The schema derived from these reference triplets encompasses 159 unique relation types.
    • REBEL: The authors select a random sample of 1000 text-triplet pairs from it. This subset induces a schema with 200 distinct relation types.
    • Wiki-NRE: The authors sample 1000 text-triplet pairs from it, resulting in a schema with 45 unique relation types.

    These datasets were chosen due to their rich variety of relation types. In the experiments, this paper focuses on extracting relations as the only schema component available across all datasets.

  1. EDC Models

    EDC contains multiple modules that are powered by LLMs.

    • OIE module: Since it is the key upstream module that determines the semantic content captured in the KG, the authors tested different LLMs of different sizes including GPT-4, GPT-3.5-turbo, and Mistral-7b.
    • Framework’s remaining components which required prompting: They used GPT-3.5-turbo.
    • Canonicalization phase: The E5-Mistral-7b model was utilized for vector similarity searches without modifications.
    • Schema Retriever: It is a fine-tuned variant of the sentence embedding model E5-mistral-7b-instruct.
  1. Evaluation Criteria and Baselines
    • Target Alignment

      The authors compare EDC and EDC+R against the specialized trained models for each of the datasets:

      • REGEN is the SOTA model for WebNLG.
      • GenIE is the state-of-the-art model for REBEL and Wiki-NRE.

      This paper uses the WEBNLG evaluation script which computes the Precision, Recall, and F1 scores for the output triplets against the ground truth in a token-based manner. Metrics based on Named Entity Evaluation were used to measure the Precision, Recall, and F1 score in three different ways.

      • Exact: Requires a complete match between the candidate and reference triple, disregarding the type (subject, relation, object).
      • Partial: Allows for at least a partial match between the candidate and reference triple, disregarding the type.
      • Strict: Demands an exact match between the candidate and reference triplet, including the element types.
    • Self Canonicalization

      For evaluating self-canonicalization performance, comparisons are made with:

      • Baseline Open KG, which is the initial open KG output from the OIE phase.
      • CESI, recognized as a leading clustering-based approach for open KG canonicalization. By applying CESI to the open KG, the authors aim to contrast its performance against canonicalization by EDC.

      The component was evaluated manually, focusing on three key aspects that reflect the intrinsic quality of an extracted KG:

      • Precision: The canonicalized triplets remain correct and meaningful with respect to the text compared to the OIE triplets.
      • Conciseness: The schema’s brevity is measured by the number of relations types.
      • Redundancy: They employ a redundancy score — the average cosine similarity among each canonicalized relation and its nearest counterpart — where low scores indicate that the schema’s relations are semantically distinct.

Results

Figure 7: Performance of EDC and EDC+R on WebNLG, REBEL, and Wiki-NRE datasets against baselines in the Target Alignment setting (F1 scores with ‘Partial’ criteria). EDC+R only performs one iteration of refinement due to diminishing marginal improvement.

Target Alignment

The bar charts in Figure 7 summarize the Partial F1 scores obtained by EDC and EDC+R on all three datasets with different LLMs for OIE compared against the respective baselines. EDC demonstrates performance that is superior to or on par with the state-of-the-art baselines for all evaluated datasets. Comparing the LLMs, GPT-4 emerges as the top performer, with Mistral-7b and GPT-3.5-turbo exhibiting comparable results.

To evaluate the impact of the relations provided by the schema retriever during refinement, the authors conducted an ablation study with GPT-3.5-turbo by removing these relations. The results in Table 1 show that ablating the Schema Retriever leads to a decline in performance. Qualitatively, they find that the schema retriever helps to find relevant relations that are challenging for the LLMs to identify during the OIE stage.

🎓

For example, given the text ‘The University of Burgundy in Dijon has 16,800 undergraduate students’, the LLMs extract [‘University of Burgundy’, ‘location’, ‘Dijon’] during OIE. Although semantically correct, this relation overlooks the more specific relation present in the target schema, namely ‘campus’, for denoting university’s location. The schema retriever successfully identifies this finer relation, enabling the LLMs to adjust their extraction to [‘University of Burgundy’, ‘campus’, ‘Dijon’].

Self Canonicalization

The authors evaluate EDC’s self-canonicalization performance utilizing GPT-3.5-turbo for OIE. They omit refinement in Self Canonicalization setting, and in subsequent iterations, the self-constructed canonicalized schema becomes the target schema. They conducted a targeted human evaluation of knowledge graphs. This evaluation involved two independent annotators assessing the reasonableness of triplet extractions from given text without prior knowledge of the system’s details. They observed a high inter-annotator agreement score of 0.94.

The evaluation results and schema metrics are summarized in Table 2.

Conclusion

In this paper, the authors presented EDC, an LLM-based three-phase framework that addresses the problem of KGC by open information extraction followed by post-hoc canonicalization. Experiments show that EDC and EDC+R are able to extract better KGs than specialized trained models when a target schema is available and dynamically create a schema when none is provided. The scalability and versatility of EDC opens up many opportunities for applications: it allows us to automatically extract high-quality KGs from general text using large schemas like Wikidata and even enrich these schemas with newly discovered relations.

Personal Thoughts

References

Paper: Zhang, Bowen, and Harold Soh. "Extract, define, canonicalize: An llm-based framework for knowledge graph construction." arXiv preprint arXiv:2404.03868 (2024).

Code: https://github.com/clear-nus/edc