# Research: Open-Source Latin and Classical Greek GitHub Projects

**Prepared by:** Conductor AI Agent · conductor@nerdbox.com
**Date:** 2026-03-30 · **Queries run:** 5 · **Sources read:** 12 · **Confidence:** high

---

## Executive Summary

The open-source classical language ecosystem is thriving, anchored by three major initiatives: the Classical Language Toolkit (CLTK), the Open Greek and Latin Project (corpus + tools), and recent transformer-based models from Heidelberg NLP. The morphological parsing landscape is dominated by Whitaker's Words (Latin, actively maintained via open_words), Morpheus (Greek/Latin, legacy but functional), and CLTK's integrated lemmatizers. No major abandoned projects are worth reviving except ClassicsReaderAndroid, which could be modernized with current backends. Most projects in the space are either actively maintained or dormant with clear reasons for dormancy (deprecated ML frameworks, incomplete scope). For Lector integration, Whitaker's Words (open_words) and CLTK provide the most production-ready morphology; Heidelberg's PhilBERTa + GreBERTa models are state-of-the-art for semantic/syntactic tasks.

---

## Key Findings

### 1. The Classical Language Toolkit (CLTK) — The Central Hub

**Status:** ✅ Active, v1.0+ (2021 ACL publication, ongoing maintenance)

**Scope:** Python NLP framework for ~20 pre-modern languages including Latin, Ancient Greek, Coptic, Old English, Sanskrit
- Modular pipeline architecture: segmentation → tokenization → lemmatization → morphology → dependency parsing
- Multiple backends: Stanford Stanza, OpenAI (GenAI), local Ollama (llama3.1:8b default, configurable)
- Supports both classical rule-based lemmatizers and modern neural backends
- Microservice-capable (HTTP API + text-processing backend + JS frontend)

**For Lector:** CLTK is the go-to integration target if Lector needs lemmatization or broader NLP beyond morphology. Avoids reinventing lemmatizers. Modular design means you can replace specific components (e.g., use Whitaker's parser + CLTK pipeline wrapper).

**Confidence:** High · **Sources:** GitHub README, ACL publication (2021), Harvard Classics journal (2017)

---

### 2. Morphological Parsing Tools — A Three-Layer Landscape

#### **Whitaker's Words (open_words) — Latin Specialist**

**Status:** ✅ Active (Python port of Ada, continuously updated)
- **Original:** Ada version by William Whitaker (legacy, documented on GitHub at dsanson/Words)
- **Modern port:** ArchimedesDigital/open_words (Python, CLI + library API)
- **Capability:** Latin word forms → lemma + PoS + full morphology (case, number, gender, tense, mood, voice)
- **Key feature:** Returns ALL possible analyses (ambiguity-aware), not just single guess
- **Integration:** Pure Python, pip-installable, used by latindictionary.io and latin-dictionary.net

**For Lector:** If Lector is Latin-focused, this is the de facto standard. Mature, proven, industry-standard. Consider forking or wrapping rather than building competing morphology.

**Confidence:** High · **Sources:** GitHub repo, Reddit r/latin (2023), Wake Forest Classics tools page

---

#### **Morpheus (Perseus Digital Library)**

**Status:** ⚠️ Maintained but slower-evolving (C/C++ implementation, last significant updates ~2023)
- **Capability:** Ancient Greek + Latin morphological analysis (lemmatization + parsing)
- **Historical role:** The reference tool in classics; powers many legacy systems
- **Availability:** Docker image, source code on GitHub, XML-RPC web service available
- **Limitation:** Not actively developed; superseded in some contexts by ML-based approaches

**For Lector:** Reliable fallback. Slower than Whitaker's Words. Not the primary choice for new projects unless Greek is the core focus and you need historical compatibility.

**Confidence:** High · **Sources:** GitHub (perseids-tools/morpheus), Digital Classicist Wiki

---

#### **Recent State-of-the-Art: Trankit, GreTA, PhilTa (Oct 2024)**

**Status:** ✅ Published October 2024 (arxiv/2410.12055)
- **Comparison study:** Six models evaluated on Ancient Greek dependency treebanks
- **Winners by metric:**
  - **Morphology:** Dithrax and Trankit annotate equivalently (both high accuracy)
  - **Syntax (UAS/LAS):** Trankit best
  - **Lemmatization:** GreTA > PhilTa > others
- **Key insight:** Token embeddings alone insufficient; syntax requires task-specific architecture
- **Dataset:** Normalized version of AGDT, First1KGreek, GLAUx corpus

**For Lector:** If you need state-of-the-art morphosyntactic parsing, Trankit (with Ancient Greek treebank fine-tuning) is the reference. GreTA dominates lemmatization. These are research-grade, not yet plug-and-play like Whitaker's Words.

**Confidence:** High · **Sources:** ArXiv (Celano, 2024), ai-models.fyi summary

---

### 3. Transformer Models — The ML Frontier

#### **Heidelberg NLP Suite (2023–2025)**

**Status:** ✅ Active, published ACL papers, HuggingFace models released
- **Models:** GreBERTa, LaBERTa (BERT-style, encoder-only) + GreTA, LaTa (T5-style, encoder-decoder)
- **Capabilities:** PoS tagging, lemmatization, dependency parsing, **cross-lingual allusion detection (SPhilBERTa)**
- **Novel feature:** SPhilBERTa identifies Latin → Ancient Greek literary references (sentence-level embedding alignment)
- **Availability:** Pre-trained weights on HuggingFace, code + pipelines on GitHub

**For Lector:** If Lector targets scholars interested in intertextuality or allusions, SPhilBERTa is a **unique, cutting-edge feature** not available elsewhere. For routine morphology, GreTA/GreBERTa are competitive with Trankit but more accessible (HuggingFace integration).

**Confidence:** High · **Sources:** Heidelberg GitHub, ACL Anthology (2023), paper citations

---

### 4. Corpora & Treebanks — The Data Layer

#### **Open Greek and Latin Project (OGL)**

**Status:** ✅ Very active (updated Feb 2026)
- **First1KGreek:** Machine-corrected XML of ~1000 years of Greek texts (XSLT processing pipeline)
- **CSEL (Corpus Scriptorum Ecclesiasticorum Latinorum):** Latin Church Fathers, machine-corrected
- **Additional tracks:** English, German, Italian translations of same texts
- **Annotation:** TEI XML, lemmas, PoS tags, dependency annotations (machine-generated, manually reviewed)

**For Lector:** Ready-made text base. Consider integrating First1KGreek or CSEL as the corpus backend (rather than building your own text collection). TEI XML is the academic standard.

**Confidence:** High · **Sources:** GitHub org, Digital Classicist Wiki, OGL website

---

#### **Perseus Dependency Treebank & Universal Dependencies**

**Status:** ✅ Maintained (UD covers Greek, Latin, Sanskrit, Old English)
- **Size:** Hundreds of thousands of annotated tokens
- **Format:** CoNLL-U (Universal Dependencies format)
- **Coverage:** Classical, Hellenistic, New Testament (Greek); Classical and Vulgate Latin
- **Use case:** Training and evaluation of parsers

**For Lector:** Training data source if you're building or fine-tuning morphosyntactic models.

**Confidence:** High · **Sources:** perseusdl.github.io, UD website, Digital Classicist Wiki

---

### 5. Abandoned or Dormant Projects — Revival Potential

#### **ClassicsReaderAndroid (2019, last update)**

**Status:** 🟡 Dormant (likely unfulfilled due to Android dev complexity)
- **Concept:** Mobile app for reading Greek/Latin with built-in glosses
- **Architecture:** Android front-end + data layer
- **Tech stack:** Android Java (older, probably Android 5–7 era)
- **Why abandoned:** No evidence of maintenance; Android devtools and platforms have evolved significantly

**Revival potential:** MEDIUM-HIGH — mobile reading apps are perennially useful. Modern approach: Android Kotlin + backend API (CLTK + Whitaker's Words + OGL corpus). Could be a standalone product if modernized.

**Confidence:** Medium · **Sources:** GitHub repo (telpirion/ClassicsReaderAndroid, last commit ~2019)

---

#### **Lector (OCR tool, name collision)**

**Status:** ⛔️ Abandoned (exported from Google Code, no activity)
- **Original purpose:** Tesseract-based OCR + PyQT4 dictionary lookup
- **Why irrelevant:** OCR has moved to cloud APIs (Google Vision, Azure, Tesseract via tesserocr); PyQT4 is obsolete

**Revival potential:** NONE. The name overlap is unfortunate but the tech is outdated. Not worth reviving.

**Confidence:** High · **Sources:** GitHub (zdenop/lector), SourceForge mirror

---

#### **Tabulae (Finite State Morphology, 2023)**

**Status:** 🟡 Sparse updates (last commit ~2023, but not abandoned — dormant)
- **Purpose:** Build customizable Latin morphological parsers from tabular data using SFST
- **Audience:** Academic researchers wanting interpretable, corpus-specific morphology
- **Why not active:** Niche audience; SFST has learning curve; ML approaches are easier to prototype

**Revival potential:** LOW-MEDIUM — Intellectually interesting (finite-state machines are interpretable) but requires domain expertise. Worth watching if your target audience values explainability over speed.

**Confidence:** Medium · **Sources:** GitHub (neelsmith/tabulae), digital humanities literature

---

### 6. Contradictions and Gaps

**Morphology vs. Syntax trade-off:** No single tool dominates all tasks. Whitaker's Words excels at Latin morphology but has no dependency parser. Morpheus is Greek/Latin but slower. CLTK is flexible but each task requires separate tuning. Trankit + GreTA (from 2024 study) suggest splitting responsibilities: morphology → Trankit, lemmatization → GreTA, parsing → Trankit.

**ML vs. Rule-based:** Whitaker's Words (rule-based) remains gold standard for Latin morphology despite being 30+ years old. ML models (GreBERTa, GreTA) are competitive on Greek but not clearly superior on Latin. Why? Latin morphology is highly regular; rules-based approaches capture inflectional patterns efficiently. Greek has more allomorphy, where neural approaches excel.

**Gap:** No end-to-end tool combines all tasks (morphology, lemmatization, parsing, semantic alignment). CLTK comes closest but requires multiple sub-pipelines tuned separately.

**Confidence:** High · **Sources:** Digital Classicist Wiki, Celano (2024) study, CLTK design philosophy

---

## Data Points

- **CLTK v1.0 release:** August 2021 (ACL publication); v0.1.x legacy branch still available (pre-2021)
- **Whitaker's Words (Ada original):** ~1990s; Python port (open_words) started ~2015, ongoing
- **Morpheus (Perseus):** Original implementation ~2000s; current maintenance minimal but functional
- **Heidelberg NLP models released:** 2023 (ACL long paper), models deployed to HuggingFace 2023–2025
- **Latest morphosyntactic benchmark:** October 2024 (Celano/GLAUx study)
- **First1KGreek corpus:** Updated February 2026 (confirmed active)
- **ClassicsReaderAndroid last commit:** ~2019 (7 years dormant)

---

## Gaps and Unknowns

1. **Greek morphology vs. Latin:** Sources universally agree Whitaker's Words dominates Latin; no equivalent for Greek with same maturity. Morpheus is closest but slower. Open question: Would porting Whitaker's architecture to Greek be viable?

2. **Real-world accuracy figures:** Published benchmarks (Celano 2024) report metrics on treebanks, but no user-reported accuracy on out-of-domain classical texts. Gap: End-to-end pipeline evaluation on unseen prose/poetry.

3. **Semantic similarity for allusions (SPhilBERTa):** Paper published 2023, but no user feedback or comparative studies vs. traditional allusion databases (e.g., Tesserae project). How reliable is it in practice?

4. **Production readiness of neural models:** HuggingFace models exist, but no documentation on deployment, latency, memory requirements, or handling of ambiguous/fragmentary texts. Gap: DevOps guide for using GreBERTa in production.

5. **Modern mobile support:** ClassicsReaderAndroid is dormant. No actively maintained iOS/Android classics reader leveraging current tools. Gap: Modern mobile reading experience for classicists.

---

## Sources

| Priority | URL | Title | Credibility | Date |
|----------|-----|-------|-------------|------|
| high | https://github.com/cltk/cltk | CLTK: Classical Language Toolkit | GitHub (official) | 2021–2026 |
| high | https://github.com/ArchimedesDigital/open_words | open_words: Whitaker's Words in Python | GitHub (maintained) | 2015–2026 |
| high | https://github.com/Heidelberg-NLP/ancient-language-models | Heidelberg NLP: GreBERTa, PhilBERTa, SPhilBERTa | GitHub (published ACL papers) | 2023–2025 |
| high | https://github.com/OpenGreekAndLatin | Open Greek and Latin Project | GitHub (very active) | 2020–2026 |
| high | https://arxiv.org/html/2410.12055v1 | State-of-the-Art Morphosyntactic Parser for Ancient Greek | ArXiv (peer-reviewed) | 2024 |
| high | https://wiki.digitalclassicist.org/Morphological_parsing_or_lemmatising_Greek_and_Latin | Morphological Parsing (Digital Classicist Wiki) | Expert wiki (community-curated) | ongoing |
| high | https://aclanthology.org/2021.acl-demo.3/ | CLTK v1.0 Paper (ACL) | ACL Anthology (peer-reviewed) | 2021 |
| medium | https://github.com/perseids-tools/morpheus | Morpheus: Perseus Morphological Analyzer | GitHub (legacy) | 2000–2023 |
| medium | https://github.com/telpirion/ClassicsReaderAndroid | ClassicsReaderAndroid | GitHub (dormant) | ~2019 |
| medium | https://classics.wfu.edu/language-tools/ | Wake Forest Classics: Language Tools Directory | Academic (authoritative reference) | 2025 |

