This site uses AI for smell information extraction and image generation. Please note that it may contain errors or inaccuracies. Learn more

Smell Extraction Guide

This document explains how to extract smell-related information from The Tale of Genji XML texts.

Overview

The scripts/extract-smells.js script automatically extracts smell descriptions from TEI-XML files of The Tale of Genji and generates structured JSON data.

Requirements

Prerequisites

  • Node.js (v18 or higher recommended)
  • Azure OpenAI API access

Environment Variables

Set the following environment variables in your .env file:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4

Usage

Basic Execution

node scripts/extract-smells.js [XML_FILE_PATH] [OUTPUT_JSON_PATH]

Examples

# Default (processes 01.xml)
node scripts/extract-smells.js

# Custom paths
node scripts/extract-smells.js /path/to/source.xml /path/to/output.json

Extracted Information

The script extracts the following information from each smell description:

FieldDescriptionExample
BookBook name"demo0"
Smell_WordWord representing the smell"aroma"
Smell_SourceSource of the smell"of freshly baked bread"
QualityQuality/nature of smell"warm", "sweet"
Odour_CarrierMedium of smell"through the air"
Evoked_OdorantEvoked odorant"cinnamon"
LocationLocation"the bakery"
PerceiverPerceiver"I"
TimeTime"this morning"
CircumstancesCircumstances"as I walked past..."
EffectEffect/impact"making my mouth water"
SentenceBeforePrevious sentenceContextual preceding part
SentenceTarget sentenceSentence with smell description
SentenceAfterFollowing sentenceContextual following part
Sentence_ModernJapaneseModern Japanese translationAI-generated translation
Sentence_EnglishEnglish translationAI-generated translation
pbPage number"6"
volVolume number"01"

Processing Flow

1. XML Parsing

  • Reads TEI-XML format files
  • Splits pages by <pb> (page break) tags
  • Extracts text content from each page

2. AI-Based Smell Information Extraction

Uses GPT-4 to perform the following tasks:

  1. Smell Expression Detection: Identifies smell-related descriptions in text
  2. Metadata Extraction: Extracts structured information listed above
  3. Translation Generation: Generates modern Japanese and English translations

3. Prompt Template

The script uses a dedicated prompt template:

  • Learns extraction methods from English examples
  • Shows output examples in table format
  • Requests structured JSON output
  • Facilitates understanding of classical Japanese context

4. Result Storage

  • Outputs structured data in JSON format
  • Saves results per page as an array
  • Records error information

Output Format

[
  {
    "pb": "6",
    "data": {
      "smells": [
        {
          "Book": "demo0",
          "Smell_Word": "御にほひ",
          "Smell_Source": "この御にほひ",
          "Quality": "",
          "Odour_Carrier": "",
          "Evoked_Odorant": "",
          "Location": "",
          "Perceiver": "この君",
          "Time": "",
          "Circumstances": "世にもてかしつきゝこゆれと...",
          "Effect": "おほかたのやむことなき...",
          "SentenceBefore": "一のみこは右大臣の...",
          "Sentence": "この御にほひにはならひ給へくもあらさりけれは...",
          "SentenceAfter": "はしめよりをしなへてのうへ...",
          "Sentence_ModernJapanese": "This lord was not accustomed to this fragrance...",
          "Sentence_English": "This lord was not accustomed to this fragrance..."
        }
      ]
    }
  }
]

Performance and Cost

API Usage

  • Rate Limiting: 1-second wait after each page processing
  • Model: GPT-4 (gpt-4.1 recommended)
  • Temperature: temperature: 0 (consistency-focused)
  • Output Format: JSON Object (guaranteed structured output)

Estimated Cost

For 1 volume (approximately 50 pages):

  • ~50 API requests
  • Processing time: ~1-2 minutes
  • Cost: Depends on token usage (follows GPT-4 pricing)

Troubleshooting

Common Errors

  1. XML file not found

    Error: XML file not found: [path]
    

    → Verify XML file path

  2. API authentication error

    Error: Authentication failed
    

    → Check API key in .env file

  3. JSON parse error

    JSON parse failed
    

    → Check AI output format, adjust prompt

Debugging Methods

  • Page numbers being processed are displayed in console
  • Check page number where error occurred
  • Review error field in smells-output.json

Post-Processing Data

After extraction, index the data with:

# Consolidate data into smells-index.json
npm run build:smell-index

References