Smell Extraction Guide
This document explains how to extract smell-related information from The Tale of Genji XML texts.
Overview
The scripts/extract-smells.js script automatically extracts smell descriptions from TEI-XML files of The Tale of Genji and generates structured JSON data.
Requirements
Prerequisites
- Node.js (v18 or higher recommended)
- Azure OpenAI API access
Environment Variables
Set the following environment variables in your .env file:
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_API_KEY=your-api-key
AZURE_OPENAI_DEPLOYMENT=gpt-4
Usage
Basic Execution
node scripts/extract-smells.js [XML_FILE_PATH] [OUTPUT_JSON_PATH]
Examples
# Default (processes 01.xml)
node scripts/extract-smells.js
# Custom paths
node scripts/extract-smells.js /path/to/source.xml /path/to/output.json
Extracted Information
The script extracts the following information from each smell description:
| Field | Description | Example |
|---|---|---|
Book | Book name | "demo0" |
Smell_Word | Word representing the smell | "aroma" |
Smell_Source | Source of the smell | "of freshly baked bread" |
Quality | Quality/nature of smell | "warm", "sweet" |
Odour_Carrier | Medium of smell | "through the air" |
Evoked_Odorant | Evoked odorant | "cinnamon" |
Location | Location | "the bakery" |
Perceiver | Perceiver | "I" |
Time | Time | "this morning" |
Circumstances | Circumstances | "as I walked past..." |
Effect | Effect/impact | "making my mouth water" |
SentenceBefore | Previous sentence | Contextual preceding part |
Sentence | Target sentence | Sentence with smell description |
SentenceAfter | Following sentence | Contextual following part |
Sentence_ModernJapanese | Modern Japanese translation | AI-generated translation |
Sentence_English | English translation | AI-generated translation |
pb | Page number | "6" |
vol | Volume number | "01" |
Processing Flow
1. XML Parsing
- Reads TEI-XML format files
- Splits pages by
<pb>(page break) tags - Extracts text content from each page
2. AI-Based Smell Information Extraction
Uses GPT-4 to perform the following tasks:
- Smell Expression Detection: Identifies smell-related descriptions in text
- Metadata Extraction: Extracts structured information listed above
- Translation Generation: Generates modern Japanese and English translations
3. Prompt Template
The script uses a dedicated prompt template:
- Learns extraction methods from English examples
- Shows output examples in table format
- Requests structured JSON output
- Facilitates understanding of classical Japanese context
4. Result Storage
- Outputs structured data in JSON format
- Saves results per page as an array
- Records error information
Output Format
[
{
"pb": "6",
"data": {
"smells": [
{
"Book": "demo0",
"Smell_Word": "御にほひ",
"Smell_Source": "この御にほひ",
"Quality": "",
"Odour_Carrier": "",
"Evoked_Odorant": "",
"Location": "",
"Perceiver": "この君",
"Time": "",
"Circumstances": "世にもてかしつきゝこゆれと...",
"Effect": "おほかたのやむことなき...",
"SentenceBefore": "一のみこは右大臣の...",
"Sentence": "この御にほひにはならひ給へくもあらさりけれは...",
"SentenceAfter": "はしめよりをしなへてのうへ...",
"Sentence_ModernJapanese": "This lord was not accustomed to this fragrance...",
"Sentence_English": "This lord was not accustomed to this fragrance..."
}
]
}
}
]
Performance and Cost
API Usage
- Rate Limiting: 1-second wait after each page processing
- Model: GPT-4 (
gpt-4.1recommended) - Temperature:
temperature: 0(consistency-focused) - Output Format: JSON Object (guaranteed structured output)
Estimated Cost
For 1 volume (approximately 50 pages):
- ~50 API requests
- Processing time: ~1-2 minutes
- Cost: Depends on token usage (follows GPT-4 pricing)
Troubleshooting
Common Errors
-
XML file not found
Error: XML file not found: [path]→ Verify XML file path
-
API authentication error
Error: Authentication failed→ Check API key in
.envfile -
JSON parse error
JSON parse failed→ Check AI output format, adjust prompt
Debugging Methods
- Page numbers being processed are displayed in console
- Check page number where error occurred
- Review error field in
smells-output.json
Post-Processing Data
After extraction, index the data with:
# Consolidate data into smells-index.json
npm run build:smell-index
References
- Azure OpenAI Service Documentation
- TEI-XML Specification
- Project root
README.md