Tibetan Text Metrics
Compare Tibetan texts to discover how similar they are. This tool helps scholars identify shared passages, textual variations, and relationships between different versions of Tibetan manuscripts. Part of the TTM project.
Step 1: Upload Your Texts
Upload two or more Tibetan text files (.txt format). If your texts have chapters, separate them with the ༈ marker so the tool can compare chapter-by-chapter.
Tip: Files should be under 1MB for best performance. Use UTF-8 encoded .txt files.
Step 2: Choose Analysis Type
Pick a preset for quick results, or use Custom for full control.
What each preset includes:
| Preset | Jaccard | LCS | Fuzzy | Semantic AI |
|---|---|---|---|---|
| Standard | ✓ | ✓ | ✓ | — |
| Deep | ✓ | ✓ | ✓ | ✓ |
| Quick | ✓ | — | — | — |
Fine-tune each metric and option:
Compare the actual words used in texts
'Word' keeps multi-syllable words together — recommended for Jaccard.
Remove common particles (གི, ལ, ནི) before comparing.
Treat variants as equivalent (གི/ཀྱི/གྱི → གི). Useful for different scribal conventions.
Find shared passages in the same order
Finds the longest sequence of words appearing in both texts.
'min' is useful for finding quotes or excerpts.
Detect similar but not identical text
All options work at the Tibetan syllable level.
Compare meaning using AI (slower)
'buddhist-sentence-similarity' works best for Buddhist texts.
See step-by-step progress during analysis.
Results
Results Summary — Compare chapters across your texts
Get Expert Insights
Let AI help you understand what the numbers mean and what patterns they reveal about your texts.
Understanding Your Results
After running the analysis, click "Explain My Results" to get a plain-language interpretation of what the similarity scores mean for your texts.
Visual Comparison
Vocabulary Overlap (Jaccard Similarity)
What it measures: How many unique words appear in both texts.
How to read it: A score of 70% means 70% of all unique words found in either text appear in both. Higher scores = more shared vocabulary.
What it tells you:
- High scores (>70%): Texts use very similar vocabulary — possibly the same source or direct copying
- Medium scores (40-70%): Texts share significant vocabulary — likely related topics or traditions
- Low scores (<40%): Texts use different words — different sources or heavily edited versions
Good to know: This metric ignores word order and how often words repeat. It only asks "does this word appear in both texts?"
Tips:
- Use the "Filter common words" option to focus on meaningful content words rather than grammatical particles.
- Word mode is recommended for Jaccard. Syllable mode may inflate scores because common syllables (like ས, ར, ན) appear in many different words.
Shared Sequences (Longest Common Subsequence)
What it measures: The longest chain of words that appears in both texts in the same order.
How to read it: Higher scores mean longer shared passages. A score of 0.6 means 60% of the text follows the same word sequence.
Example: If Text A says "the quick brown fox" and Text B says "the lazy brown dog", the shared sequence is "the brown" — words that appear in both, in the same order.
What it tells you:
- High scores (>0.6): Texts share substantial passages — likely direct copying or common source
- Medium scores (0.3-0.6): Some shared phrasing — possibly related traditions
- Low scores (<0.3): Different word ordering — independent compositions or heavy editing
Why this is different from vocabulary overlap:
- Vocabulary overlap asks: "Do they use the same words?"
- Sequence matching asks: "Do they say things in the same order?"
Two texts might share many words (high Jaccard) but arrange them differently (low LCS), suggesting they discuss similar topics but were composed independently.
Approximate Matching (Fuzzy Similarity)
What it measures: How similar texts are, even when they're not exactly the same.
How to read it: Scores from 0 to 1. Higher = more similar. A score of 0.85 means the texts are 85% alike.
What it tells you:
- High scores (>0.8): Very similar texts with minor differences (spelling, small edits)
- Medium scores (0.5-0.8): Noticeably different but clearly related
- Low scores (<0.5): Substantially different texts
Why it matters for Tibetan texts:
- Catches spelling variations between manuscripts
- Finds scribal differences and regional conventions
- Identifies passages that were slightly modified
Recommended methods:
- Syllable pairs (ngram): Best for Tibetan — compares pairs of syllables
- Count syllable changes: Good for finding minor edits
- Word frequency: Useful when certain words repeat often
Meaning Similarity (Semantic Analysis)
What it measures: Whether texts convey similar meaning, even if they use different words.
How to read it: Scores from 0 to 1. Higher = more similar meaning. A score of 0.8 means the texts express very similar ideas.
What it tells you:
- High scores (>0.75): Texts say similar things, even if worded differently
- Medium scores (0.5-0.75): Related topics or themes
- Low scores (<0.5): Different subject matter
How it works: An AI model (trained on Buddhist texts) reads both passages and judges how similar their meaning is. This catches similarities that word-matching would miss.
When to use it:
- Finding paraphrased passages
- Identifying texts that discuss the same concepts differently
- Comparing translations or commentaries
Note: This takes longer to compute but provides insights the other metrics can't.
Text Length by Section
This chart shows how many words are in each chapter or section. Taller bars = longer sections.
Why it matters: If sections have very different lengths, it might explain differences in similarity scores.
Vocabulary Containment (Directional)
What it shows: What percentage of one text's unique vocabulary appears in the other text.
How to read it:
- "Text A → Text B" means: "What % of Text A's vocabulary is found in Text B?"
- 90% means 90% of the unique words in the source text also appear in the target text
What it tells you:
- If Text A → Text B is 95% but Text B → Text A is 60%, then Text B contains almost all of Text A's vocabulary plus additional words
- This suggests Text B might be an expansion or commentary on Text A
- Asymmetric containment often indicates a base text + commentary relationship
Useful for:
- Identifying which text is the "base" (shorter vocabulary fully contained in longer text)
- Understanding directionality of textual relationships
- Distinguishing between shared sources vs. one text derived from another
Tip: Unlike Jaccard (which is symmetric), containment is directional — it tells you which text's vocabulary is "inside" the other.
Metric progress will appear here during analysis