Tibetan Text Metrics Web App

A user-friendly web application for analyzing textual similarities and variations in Tibetan manuscripts, providing a graphical interface to the core functionalities of the Tibetan Text Metrics (TTM) project. Powered by Mistral 7B via OpenRouter for advanced text analysis.

Step 1: Upload Your Tibetan Text Files

Upload one or more .txt files. Each file should contain Unicode Tibetan text, segmented into chapters/sections if possible using the marker '༈' (sbrul shad).

Note: Maximum file size: 10MB per file. For optimal performance, use files under 1MB.

Step 2: Configure and run the analysis

Choose your analysis options and click the button below to compute metrics and view results. For meaningful analysis, ensure your texts are segmented by chapter or section using the marker '༈' (sbrul shad). The tool will split files based on this marker.

Compute semantic similarity? (Experimental)

Semantic similarity will be time-consuming. Choose 'No' to speed up analysis if these metrics are not required.

Select Embedding Model

Using Facebook's pre-trained FastText model for semantic similarity. Other model options have been removed.

Stopword Filtering

Choose how aggressively to filter out common Tibetan particles and function words when calculating similarity. This helps focus on meaningful content words.

Results

Similarity Metrics Preview

Similarity Metrics Preview

AI Analysis

The AI will analyze your text similarities and provide insights into patterns and relationships. Make sure to set up your OpenRouter API key for this feature.

AI-Powered Analysis

The AI analysis is powered by Mistral 7B Instruct via the OpenRouter API. To use this feature:

  1. Get an API key from OpenRouter
  2. Create a .env file in the webapp directory
  3. Add: OPENROUTER_API_KEY=your_api_key_here

The AI will automatically analyze your text similarities and provide insights into patterns and relationships.

Analysis of Tibetan Text Similarity Metrics

Click the 'Help Interpret Results' button above to generate an AI-powered analysis of your similarity metrics.

Detailed Metric Analysis

Jaccard Similarity (%)

This metric quantifies the lexical overlap between two text segments by comparing their sets of unique words, optionally filtering out common Tibetan stopwords.

It essentially answers the question: 'Of all the distinct words found across these two segments, what proportion of them are present in both?' It is calculated as (Number of common unique words) / (Total number of unique words in both texts combined) * 100.

Jaccard Similarity is insensitive to word order and word frequency; it only cares whether a unique word is present or absent. A higher percentage indicates a greater overlap in the vocabularies used in the two segments.

Stopword Filtering: When enabled (via the "Filter Stopwords" checkbox), common Tibetan particles and function words are filtered out before comparison. This helps focus on meaningful content words rather than grammatical elements.