Tibetan Text Metrics Web App

A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project. Powered by advanced language models via OpenRouter for in-depth text analysis.

Step 1: Upload Your Tibetan Text Files

Upload two or more .txt files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (sbrul shad).

Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.

Step 2: Configure and run the analysis

Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (sbrul shad). The tool will split files based on this marker.

Compute semantic similarity? (Experimental)

Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.

Select Embedding Model

Select the embedding model to use for semantic similarity analysis. Only Hugging Face sentence-transformers are supported.

1 64

Display a progress bar during embedding generation. Useful for large datasets.

Stopword Filtering

Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words.

Enable Fuzzy String Matching

Fuzzy matching helps detect similar but not identical text segments. Useful for identifying variations and modifications.

Fuzzy Matching Method

Select the fuzzy matching algorithm to use:

• token_set: Best for texts with different word orders and partial overlaps. Compares unique words regardless of their order (recommended for Tibetan texts).

• token_sort: Good for texts with different word orders but similar content. Sorts words alphabetically before comparing.

• partial: Best for finding shorter strings within longer ones. Useful when one text is a fragment of another.

• ratio: Simple Levenshtein distance ratio. Best for detecting small edits and typos in otherwise identical texts.

Results

Similarity Metrics Preview

Similarity Metrics Preview

AI Analysis

The AI will analyze your text similarities and provide insights into patterns and relationships.

Analysis of Tibetan Text Similarity Metrics

Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.

Detailed Metric Analysis

Jaccard Similarity (%)

This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, optionally filtering out common Tibetan stopwords.

It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as (Number of common unique words) / (Total number of unique words in both texts combined) * 100.

Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.

Stopword Filtering: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.

Metric progress will appear here during analysis