Tibetan Text Metrics Web App
A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project. Powered by Mistral 7B via OpenRouter for advanced text analysis.
Step 1: Upload Your Tibetan Text Files
Upload one or more .txt
files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (sbrul shad).
Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.
Step 2: Configure and run the analysis
Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (sbrul shad). The tool will split files based on this marker.
Using Facebook's pre-trained FastText model for semantic similarity. Other model options have been removed.
Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words.
Results
Similarity Metrics Preview
AI Analysis
The AI will analyze your text similarities and provide insights into patterns and relationships. Make sure to set up your OpenRouter API key for this feature.
AI-Powered Analysis
The AI analysis is powered by Mistral 7B Instruct via the OpenRouter API. To use this feature:
- Get an API key from OpenRouter
- Create a
.env
file in the webapp directory - Add:
OPENROUTER_API_KEY=your_api_key_here
The AI will automatically analyze your text similarities and provide insights into patterns and relationships.
Analysis of Tibetan Text Similarity Metrics
Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.
Detailed Metric Analysis
Jaccard Similarity (%)
This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, optionally filtering out common Tibetan stopwords.
It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as (Number of common unique words) / (Total number of unique words in both texts combined) * 100
.
Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.
Stopword Filtering: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.
Normalized LCS (Longest Common Subsequence)
This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.
No Stopword Filtering. Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.
Note on Interpretation: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.
Semantic Similarity
Computes the cosine similarity between semantic embeddings of text segments:
FastText Model: Uses the official Facebook FastText Tibetan model (facebook/fasttext-bo-vectors) pre-trained on a large corpus of Tibetan text. Falls back to a custom model only if the official model cannot be loaded.
- Processes Tibetan text using botok tokenization (same as other metrics)
- Uses the pre-tokenized words from botok rather than doing its own tokenization
- Better for texts with specialized Tibetan vocabulary
- More stable results for general Tibetan text comparison
- Optimized for Tibetan language with:
- Word-based tokenization preserving Tibetan syllable markers
- TF-IDF weighted averaging for word vectors (distinct from the TF-IDF Cosine Similarity metric)
- Enhanced parameters based on Tibetan NLP research
Stopword Filtering: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before computing embeddings. This helps focus on meaningful content words.
Note: This metric works best when combined with other metrics for a more comprehensive analysis.
TF-IDF Cosine Similarity
This metric calculates Term Frequency-Inverse Document Frequency (TF-IDF) scores for each word in each text segment, optionally filtering out common Tibetan stopwords.
TF-IDF gives higher weight to words that are frequent within a particular segment but relatively rare across the entire collection of segments. This helps identify terms that are characteristic or discriminative for a segment. When stopword filtering is enabled, the TF-IDF scores better reflect genuinely significant terms by excluding common particles and function words.
Each segment is represented as a vector of these TF-IDF scores, and the cosine similarity is computed between these vectors. A score closer to 1 indicates that the two segments share more important, distinguishing terms, suggesting they cover similar specific topics or themes.
Stopword Filtering: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out. This can be toggled on/off to compare results with and without stopwords.
Word Counts per Segment
This chart displays the number of words in each segment of your texts after tokenization.