Smell Extraction Guide

This document explains how to extract smell-related information from The Tale of Genji XML texts.

Overview

The scripts/extract-smells.js script automatically extracts smell descriptions from TEI-XML files of The Tale of Genji and generates structured JSON data.

Requirements

Prerequisites

Node.js (v18 or higher recommended)
Azure OpenAI API access

Environment Variables

Set the following environment variables in your .env file:

AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4

Usage

Basic Execution

node scripts/extract-smells.js [XML_FILE_PATH] [OUTPUT_JSON_PATH]

Examples

# Default (processes 01.xml)
node scripts/extract-smells.js

# Custom paths
node scripts/extract-smells.js /path/to/source.xml /path/to/output.json

Extracted Information

The script extracts the following information from each smell description:

Field	Description	Example
`Book`	Book name	"demo0"
`Smell_Word`	Word representing the smell	"aroma"
`Smell_Source`	Source of the smell	"of freshly baked bread"
`Quality`	Quality/nature of smell	"warm", "sweet"
`Odour_Carrier`	Medium of smell	"through the air"
`Evoked_Odorant`	Evoked odorant	"cinnamon"
`Location`	Location	"the bakery"
`Perceiver`	Perceiver	"I"
`Time`	Time	"this morning"
`Circumstances`	Circumstances	"as I walked past..."
`Effect`	Effect/impact	"making my mouth water"
`SentenceBefore`	Previous sentence	Contextual preceding part
`Sentence`	Target sentence	Sentence with smell description
`SentenceAfter`	Following sentence	Contextual following part
`Sentence_ModernJapanese`	Modern Japanese translation	AI-generated translation
`Sentence_English`	English translation	AI-generated translation
`pb`	Page number	"6"
`vol`	Volume number	"01"

Processing Flow

1. XML Parsing

Reads TEI-XML format files
Splits pages by <pb> (page break) tags
Extracts text content from each page

2. AI-Based Smell Information Extraction

Uses GPT-4 to perform the following tasks:

Smell Expression Detection: Identifies smell-related descriptions in text
Metadata Extraction: Extracts structured information listed above
Translation Generation: Generates modern Japanese and English translations

3. Prompt Template

The script uses a dedicated prompt template:

Learns extraction methods from English examples
Shows output examples in table format
Requests structured JSON output
Facilitates understanding of classical Japanese context

4. Result Storage

Outputs structured data in JSON format
Saves results per page as an array
Records error information

Output Format

[
  {
    "pb": "6",
    "data": {
      "smells": [
        {
          "Book": "demo0",
          "Smell_Word": "御にほひ",
          "Smell_Source": "この御にほひ",
          "Quality": "",
          "Odour_Carrier": "",
          "Evoked_Odorant": "",
          "Location": "",
          "Perceiver": "この君",
          "Time": "",
          "Circumstances": "世にもてかしつきゝこゆれと...",
          "Effect": "おほかたのやむことなき...",
          "SentenceBefore": "一のみこは右大臣の...",
          "Sentence": "この御にほひにはならひ給へくもあらさりけれは...",
          "SentenceAfter": "はしめよりをしなへてのうへ...",
          "Sentence_ModernJapanese": "This lord was not accustomed to this fragrance...",
          "Sentence_English": "This lord was not accustomed to this fragrance..."
        }
      ]
    }
  }
]

Performance and Cost

API Usage

Rate Limiting: 1-second wait after each page processing
Model: GPT-4 (gpt-4.1 recommended)
Temperature: temperature: 0 (consistency-focused)
Output Format: JSON Object (guaranteed structured output)

Estimated Cost

For 1 volume (approximately 50 pages):

~50 API requests
Processing time: ~1-2 minutes
Cost: Depends on token usage (follows GPT-4 pricing)

Troubleshooting

Common Errors

XML file not found
```
Error: XML file not found: [path]
```
→ Verify XML file path
API authentication error
```
Error: Authentication failed
```
→ Check API key in .env file
JSON parse error
```
JSON parse failed
```
→ Check AI output format, adjust prompt

Debugging Methods

Page numbers being processed are displayed in console
Check page number where error occurred
Review error field in smells-output.json

Post-Processing Data

After extraction, index the data with:

# Consolidate data into smells-index.json
npm run build:smell-index