GitHub Launches Multilingual Dataset for AI Research

GitHub unveiled the GitHub Multilingual Repositories Dataset on June 15, 2026, offering metadata on over 40 million public repositories to promote multilingual AI development. The dataset, released under a permissive CC0-1.0 license, helps developers identify multilingual content in README files, issues, and pull requests, aligning with Microsoft's 2025 commitment to improve multilingual data accessibility for open-source AI developers. Unlike raw repository dumps, the dataset focuses on discoverability by classifying the language of key repository elements using three tools—fastText, gcld3, and lingua-py—with confidence scores above 0.5. It includes metadata like repository creation dates, programming languages, and engagement metrics (stars, forks, and issue counts). The dataset addresses the historical dominance of English in training data for large language models (LLMs), which has left many languages underrepresented and limited AI tools' global utility. Its release coincides with a broader industry push for inclusive AI, as Hugging Face launched FineTranslations, a trillion-token multilingual dataset covering 500+ languages, and Microsoft Research reported that more than half of multilingual datasets are still constructed via translations from English. Researchers can use the dataset to discover non-English-speaking developer community collaboration, build evaluation sets for AI models, and measure underrepresented language representation in open source.