Ressources

Liste des LLM généralistes

Raphaël d'Assignies
12 mai 2023

Cette page référence quelques grands modèles de langage pré entraînés actuels. Pour une liste de plusieurs dizaines de milliers de modèles, il faut consulter le site HuggingFace qui fournit par ailleurs divers classements mis à jour dont celui des LLM opensource.

Alpaca

https://crfm.stanford.edu/2023/03/13/alpaca.html https://github.com/tatsu-lab/stanford_alpaca

Abstract :

Instruction-following models such as GPT-3.5 (text-davinci-003), ChatGPT, Claude, and Bing Chat have become increasingly powerful. Many users now interact with these models regularly and even use them for work. However, despite their widespread deployment, instruction-following models still have many deficiencies: they can generate false information, propagate social stereotypes, and produce toxic language. To make maximum progress on addressing these pressing problems, it is important for the academic community to engage. Unfortunately, doing research on instruction-following models in academia has been difficult, as there is no easily accessible model that comes close in capabilities to closed-source models such as OpenAI’s text-davinci-003. We are releasing our findings about an instruction-following language model, dubbed Alpaca, which is fine-tuned from Meta’s LLaMA 7B model. We train the Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.

CamemBERT : a Tasty French Language Model

https://arxiv.org/abs/1911.03894

Abstract :

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

CamemBERT 2.0

Camembert 2.0 Le modèle le plus utilisé pour le français bientôt obsolète

Il existe un besoin croissant de moteurs de recherche sémantiques fiables, c’est à dire qui soient en capacité de fournir à des modèles de langue (LLM) les documents nécessaires à leur apprentissage et à leur fonctionnement. En d’autres termes, ce procédé permet de contextualiser une requête, en s’attachant à comprendre l’intention de la question et en fournissant un ensemble de documents pour y répondre. Les meilleures de ces approches reposent sur du search and rerank.

ColossalChat (Colossail AI)

https://github.com/hpcaitech/ColossalAI

Abstract :

Colossal-AI is the first to open-source a complete RLHF pipeline that includes supervised data collection, supervised fine-tuning, reward model training, and reinforcement learning fine-tuning, based on the LLaMA pre-trained model

CroissantLLM: A Truly Bilingual French-English Language Model

https://arxiv.org/pdf/2402.00786

Abstract

We introduce CroissantLLM, a 1.3B language model pre-trained on a set of 3T English and French tokens, to bring to the research and industrial community a high-performance, fully open-sourced bilingual model that runs swiftly on consumer-grade local hardware. To that end, we pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to- French pretraining data ratio, a custom tokenizer, and bilingual finetuning datasets. We release the training dataset, notably containing a French split with manually curated, high-quality, and varied data sources. To assess performance outside of English, we craft a novel benchmark, FrenchBench, consisting of an array of classification and generation tasks, covering var-
ious orthogonal aspects of model performance in the French Language. Additionally, rooted in transparency and to foster further Large Language
Model research, we release codebases, and dozens of checkpoints across various model sizes, training data distributions, and training steps, as well
as fine-tuned Chat models, and strong translation models. We evaluate our model through the FMTI framework (Bommasani et al., 2023) and vali- date 81 % of the transparency criteria, far beyond the scores of even most open initiatives. This work enriches the NLP landscape, breaking away
from previous English-centric work to strengthen our understanding of multilingualism in language models

Claude

Claude is a next-generation AI assistant based on Anthropic’s research into training helpful, honest, and harmless AI systems. Accessible through chat interface and API in our developer console, Claude is capable of a wide variety of conversational and text processing tasks while maintaining a high degree of reliability and predictability.

Claude can help with use cases including summarization, search, creative and collaborative writing, Q&A, coding, and more. Early customers report that Claude is much less likely to produce harmful outputs, easier to converse with, and more steerable – so you can get your desired output with less effort. Claude can also take direction on personality, tone, and behavior.

Pi

Palo Alto, CA, May, 2, 2023 – Inflection AI today announced the first release of its Personal AI, Pi (heypi.com). A new class of AI, Pi is designed to be a kind and supportive companion offering conversations, friendly advice, and concise information in a natural, flowing style.

Pi was created to give people a new way to express themselves, share their curiosities, explore new ideas, and experience a trusted personal AI. It is built on world-class proprietary AI technology developed in-house. The Pi experience is intended to prioritize conversations with people, where other AIs serve productivity, search, or answering questions. Pi is a coach, confidante, creative partner, or sounding board.

FlauBERT : Unsupervised Language Model Pre-training for French

https://arxiv.org/abs/1912.05372

Abstract :

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015; Peters et al., 2018; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2019; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

GPT 4

https://cdn.openai.com/papers/gpt-4.pdf

Abstract :

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformerbased model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.

BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

https://arxiv.org/abs/2010.12321)

Abstract :

Inductive transfer learning, enabled by self-supervised learning, have taken the entire Natural Language Processing (NLP) field by storm, with models such as BERT and BART setting new state of the art on countless natural language understanding tasks. While there are some notable exceptions, most of the available models and research have been conducted for the English language. In this work, we introduce BARThez, the first BART model for the French language (to the best of our knowledge). BARThez was pretrained on a very large monolingual French corpus from past research that we adapted to suit BART’s perturbation schemes. Unlike already existing BERT-based French language models such as CamemBERT and FlauBERT, BARThez is particularly well-suited for generative tasks, since not only its encoder but also its decoder is pretrained. In addition to discriminative tasks from the FLUE benchmark, we evaluate BARThez on a novel summarization dataset, OrangeSum, that we release with this paper. We also continue the pretraining of an already pretrained multilingual BART on BARThez’s corpus, and we show that the resulting model, which we call mBARTHez, provides a significant boost over vanilla BARThez, and is on par with or outperforms CamemBERT and FlauBERT.

BLOOM (HuggingFace)

https://huggingface.co/bigscience/bloom

Abstract :

BLOOM is an autoregressive Large Language Model (LLM), trained to continue text from a prompt on vast amounts of text data using industrial-scale computational resources. As such, it is able to output coherent text in 46 languages and 13 programming languages that is hardly distinguishable from text written by humans. BLOOM can also be instructed to perform text tasks it hasn’t been explicitly trained for, by casting them as text generation tasks.

LLaMa (Meta AI)

https://research.facebook.com/file/1574548786327032/LLaMA–Open-and-Efficient-Foundation-Language-Models.pdf

Abstract :

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B. We release all our models to the research community

Cedille

https://arxiv.org/abs/2202.03371

Abstract :

Scaling up the size and training of autoregressive language models has enabled novel ways of solving Natural Language Processing tasks using zero-shot and few-shot learning. While extreme-scale language models such as GPT-3 offer multilingual capabilities, zero-shot learning for languages other than English remain largely unexplored. Here, we introduce Cedille, a large open source auto-regressive language model, specifically trained for the French language. Our results show that Cedille outperforms existing French language models and is competitive with GPT-3 on a range of French zero-shot benchmarks. Furthermore, we provide an in-depth comparison of the toxicity exhibited by these models, showing that Cedille marks an improvement in language model safety thanks to dataset filtering.

CLAIRE

Le premier modèle ouvert LLM : “CLAIRE” est sur Hugging Face

LINAGORA et la communauté OpenLLM France ont publié le premier modèle ouvert LLM : “CLAIRE”

Il s’agit du modèle Claire-7B-0.1. Particulièrement adapté au traitement de données résultants de dialogues en français.

Les données d’apprentissage qui ont été sélectionnées sont des données conversationnelles en français disponibles sous licence ouverte.

Claire-7B-0.1 se décline en deux modalités en fonction des licences et des données d’apprentissage :

  • Un premier modèle est diffusé sous licence ouverte CC-BY-NC-SA, car il a été appris sur des données dont certaines étaient en CC-BY-NC-SA. C’est celui qui a bénéficié du jeu de données le plus volumineux ;
  • Un second modèle est diffusé sous licence open source Apache V2. Son apprentissage n’utilise que des données sous licences compatibles.

Kosmos-1

Language Is Not All You Need: Aligning Perception with Language Models

https://arxiv.org/pdf/2302.14045

Abstract :

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

OPT: Open Pre-trained Transformer Language Models

https://arxiv.org/pdf/2205.01068.pdf

Abstract :

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

PaLM: Scaling Language Modeling with Pathways

https://arxiv.org/abs/2204.02311

Abstract :

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

GPT-NeoX(Eleuther AI)

Abstract :

We introduce GPT-NeoX-20B, a 20 billion parameter autoregressive language model trained on the Pile, whose weights will be made freely and openly available to the public through a permissive license. It is, to the best of our knowledge, the largest dense autoregressive model that has publicly available weights at the time of submission. In this work, we describe GPT-NeoX-20B’s architecture and training and evaluate its performance on a range of language-understanding, mathematics, and knowledge-based tasks. We find that GPT-NeoX-20B is a particularly powerful few-shot reasoner and gains far more in performance when evaluated five-shot than similarly sized GPT-3 and FairSeq models. We open-source the training and evaluation code, as well as the model weights, at https://github.com/EleutherAI/gpt-neox.

Vigogne

GitHub – bofenghuang/vigogne: French instruction-following and chat models
French instruction-following and chat models. Contribute to bofenghuang/vigogne development by creating an account on GitHub.