Iran unveils 35 billion-word Persian macro-corpus

Iran unveiled a massive Persian language corpus, consisting of 35 billion words.
The unveiling occurred during a conference on the requirements for developing the Large Language Model (LLM) for Persian, IRNA wrote.
The Persian macro-corpus, a product of artificial intelligence, was crafted by the private-sector company Targoman Intelligent Processing. This dataset is designed to support the development of advanced language processing through the innovative Large Language Model (LLM), which employs neural networks with extensive parameters in the field of AI.
This Persian macro-corpus, released as open source, offers high diversity and preserves the text structure for public accessibility. LLM, or Large Language Model, represents a groundbreaking AI approach utilizing neural networks with extensive parameters for advanced language processing.
The conference highlighted the prominent role of artificial intelligence generators in recent advancements, particularly attributed to Large Language Models (LLMs). LLMs, a type of AI model, demonstrate significant capabilities in processing multifaceted information, providing optimized and improved responses to various user queries.
During the event, an artificial intelligence professor Behrouz Minaei emphasized the crucial role of data in the utilization of large language models, noting that nations with more data hold greater power.
He underscored the importance of having a native LLM to enhance the capabilities of governments.
The secretary traced the evolution of AI technology from expert systems in the 1970s-1980s to the advent of data mining in the 1990s. He also discussed the emergence of deep learning systems, specifically highlighting the development of language models in the period of 2012-2018.
The secretary stressed the advantages of the new generation of AI systems, praising their non-domain-specific capabilities and expanded horizontal power in semantic circuits and content comprehension. The significance of having a native Persian language model was emphasized for its cultural and value-based contributions to diverse perspectives.
Search
Date archive