Tibetan Text Metrics Web App

A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project. Powered by advanced language models via OpenRouter for in-depth text analysis.

Step 1: Upload Your Tibetan Text Files

Upload two or more .txt files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (sbrul shad).

Upload Tibetan .txt files

Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.

Step 2: Configure and run the analysis

Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (sbrul shad). The tool will split files based on this marker.

Compute semantic similarity? (Experimental)

Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.

Yes No

Select Embedding Model

Select the embedding model to use for semantic similarity analysis. Only Hugging Face sentence-transformers are supported.

Stopword Filtering

Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words.

Enable Fuzzy String Matching

Fuzzy matching helps detect similar but not identical text segments. Useful for identifying variations and modifications.

Yes No

Fuzzy Matching Method

Select the fuzzy matching algorithm to use:

• token_set: Best for texts with different word orders and partial overlaps. Compares unique words regardless of their order (recommended for Tibetan texts).

• token_sort: Good for texts with different word orders but similar content. Sorts words alphabetically before comparing.

• partial: Best for finding shorter strings within longer ones. Useful when one text is a fragment of another.

• ratio: Simple Levenshtein distance ratio. Best for detecting small edits and typos in otherwise identical texts.

Results

Download CSV Results

Similarity Metrics Preview

Similarity Metrics Preview

AI Analysis

The AI will analyze your text similarities and provide insights into patterns and relationships.

Analysis of Tibetan Text Similarity Metrics

Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.

Detailed Metric Analysis

Normalized LCS (Longest Common Subsequence)

This metric measures the length of the longest sequence of words that appears in both text segments, maintaining their original relative order. Importantly, these words do not need to be directly adjacent (contiguous) in either text. For example, if Text A is 'the quick brown fox jumps' and Text B is 'the lazy cat and brown dog jumps high', the LCS is 'the brown jumps'. The length of this common subsequence is then normalized (in this tool, by dividing by the length of the longer of the two segments) to provide a score, which is then presented as a percentage. A higher Normalized LCS score suggests more significant shared phrasing, direct textual borrowing, or strong structural parallelism, as it reflects similarities in how ideas are ordered and expressed sequentially.

No Stopword Filtering. Unlike metrics such as Jaccard Similarity or TF-IDF Cosine Similarity (which typically filter out common stopwords to focus on content-bearing words), the LCS calculation in this tool intentionally uses the raw, unfiltered sequence of tokens from your texts. This design choice allows LCS to capture structural similarities and the flow of language, including the use of particles and common words that contribute to sentence construction and narrative sequence. By not removing stopwords, LCS can reveal similarities in phrasing and textual structure that might otherwise be obscured, making it a valuable complement to metrics that focus purely on lexical overlap of keywords.

Note on Interpretation: It is possible for Normalized LCS to be higher than Jaccard Similarity. This often happens when texts share a substantial 'narrative backbone' or common ordered phrases (leading to a high LCS), even if they use varied surrounding vocabulary or introduce many unique words not part of these core sequences (which would lower the Jaccard score). LCS highlights this sequential, structural similarity, while Jaccard focuses on the overall shared vocabulary regardless of its arrangement.

Tibetan Text Metrics Web App

Step 1: Upload Your Tibetan Text Files

Step 2: Configure and run the analysis

Results

AI Analysis

Analysis of Tibetan Text Similarity Metrics

Detailed Metric Analysis

Jaccard Similarity (%)

Normalized LCS (Longest Common Subsequence)

Fuzzy Similarity

Semantic Similarity

Word Counts per Segment