Papers

Preprints

DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures: Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
paper

Nested Named Entity Recognition as Single-Pass Sequence Labeling: Alberto Muñoz-Ortiz, David Vilares, Caio Corro, Carlos Gómez-Rodríguez
preprint

EuroBERT: Scaling Multilingual Encoders for European Languages: Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, Pierre Colombo
CoLM 2025 - Conference on Language Modeling
preprint - HuggingFace page

Bregman Conditional Random Fields: Sequence Labeling with Parallelizable Inference Algorithms: Caio Corro, Mathieu Lacroix, Joseph Le Roux
ACL 2025 - Annual Meeting of the Association for Computational Linguistics
paper - code

Discrete latent structure in neural networks: Vlad Niculae, Caio Corro, Nikita Nangia, Tsvetomila Mihaylova, André F. T. Martins
FnT SIG 2025 - Foundation and Trends in Signal Processing
preprint - book

CroissantLLM: A Truly Bilingual French-English Language Model: Manuel Faysse, Patrick Fernandes, Nuno Guerreiro, António Loison, Duarte Alves, Caio Corro, Nicolas Boizard, João Alves, Ricardo Rei, Pedro Martins, Antoni Bigata Casademunt, François Yvon, André Martins, Gautier Viaud, Céline Hudelot, Pierre Colombo
TMLR 2025 - Transactions on Machine Learning Research
paper - openreview

Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain: Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
Coling 2025 Workshop on Financial Technology and Natural Language Processing (FinNLP), Financial Narrative Processing (FNP), and on Large Language Models for Finance and Legal (LLMFinLegal)
paper - code

Few-shot domain adaptation for named-entity recognition via joint constrained k-means and subspace selection: Ayoub Hammal, Benno Uthayasooriyar, Caio Corro
Coling 2025 - International Conference on Computational Linguistics
paper - code

SaulLM-7B: A pioneering large language model for law: Pierre Colombo, Telmo Pessoa Pires, Malik Boudiaf, Dominic Culver, Rui Melo, Caio Corro, Andre F. T. Martins, Fabrizio Esposito, Vera Lúcia Raposo, Sofia Morgado, Michael Desa
Technical report
paper

A fast and sound tagging method for discontinuous named-entity recognition: Caio Corro
EMNLP 2024 - Conference on Empirical Methods in Natural Language Processing
paper - code

Building quantitative contrastive grammars from syntactic treebanks: Santiago Herrera, Ioana-Madalina Silai, Caio Corro, Bruno Guillaume, Sylvain Kahane
LLcD 2024 - Rencontre annuelle Langues & Langage à la croisé de Disciplines
abstract

Sparse logistic regression with high-order features for automatic grammar rule extraction from treebanks: Santiago Herrera, Caio Corro, Sylvain Kahane
LREC-Coling 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation
paper - code - extracted rules

Régression logistique parcimonieuse pour l'extraction automatique de règles de grammaire: Santiago Herrera, Caio Corro, Sylvain Kahane
TALN 2024 - Conférence sur le Traitement Automatique des Langues Naturelles
paper

Actes de la journée d’étude sur le traitement automatique des langues frugal et la recherche d'information frugale: Caio Corro, Gaël Lejeune
proceedings

Structural generalization in COGS: Supertagging is (almost) all you need: Alban Petit, Caio Corro, François Yvon
EMNLP 2023 - Conference on Empirical Methods in Natural Language Processing
paper

A dynamic programming algorithm for span-based nested named-entity recognition in \(\mathcal O(n^2)\): Caio Corro
ACL 2023 - Annual Meeting of the Association for Computational Linguistics
paper

On graph-based reentrancy-free semantic parsing: Alban Petit, Caio Corro
TACL 2023 - Transactions of the Association for Computational Linguistics
paper

On the inconsistency of separable losses for structured prediction: Caio Corro
EACL 2023 - European Chapter of the Association for Computational Linguistics
paper

Actes de la journée d’étude sur la robustesse des systemes de TAL (Robustal 2022): Caio Corro, Gaël Lejeune
proceedings

Un algorithme d'analyse sémantique fondée sur les graphes via le problème de l'arborescence généralisée couvrante: Alban Petit, Caio Corro
TALN 2022 - Conférence sur le Traitement Automatique des Langues Naturelles
paper

Ré-ordonnancement via programmation dynamique pour l'adaptation cross-lingue d'un analyseur en dépendances: Nicolas Devatine, Caio Corro, François Yvon
TALN 2022 - Conférence sur le Traitement Automatique des Langues Naturelles
paper - slides

GPU-Accelerated Forward-Backward algorithm with Application to Lattice-Free MMI: Lucas Ondel, Léa-Marie Lam-Yee-Mui, Martin Kocour, Caio Filippo Corro, Lukáš Burget
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing
paper

Preventing posterior collapse in variational autoencoders for text generation via decoder regularization: Alban Petit, Caio Corro
NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications
paper

Auto-encodeurs variationnels : contrecarrer le problème de posterior collapse grâce à la régularisation du décodeur: Alban Petit, Caio Corro
TALN 2021 - Conférence sur le Traitement Automatique des Langues Naturelles
paper

Span-based discontinuous constituency parsing: a family of exact chart-based algorithms with time complexities from O(n^6) down to O(n^3): Caio Corro
EMNLP 2020 - Conference on Empirical Methods in Natural Language Processing
paper

Sur l'impact des contraintes structurelles pour l'analyse en dépendances profondes fondée sur les graphes: Caio Corro
TALN 2020 - Conférence sur le Traitement Automatique des Langues Naturelles
paper - code

Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming: Caio Corro, Ivan Titov
ACL 2019 - Annual Meeting of the Association for Computational Linguistics
paper - poster (landscape) - poster (portrait)

Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder: Caio Corro, Ivan Titov
ICLR 2019 - Seventh International Conference on Learning Representations
paper - poster

Lagrangian Based Approaches for Lexicalized Tree Adjoining Grammar Parsing: Caio Corro
PhD thesis
pdf - slides

Efficient Discontinuous Phrase-Structure Parsing via the Generalized Maximum Spanning Arborescence: Caio Corro, Joseph Le Roux, Mathieu Lacroix
EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing
paper - poster

Transforming Dependency Structures to LTAG Derivation Trees: Caio Corro, Joseph Le Roux
TAG+ 2017 - 13th International Workshop on Tree-Adjoining Grammar and Related Formalisms
paper - slides

Dependency Parsing with Bounded Block Degree and Well-nestedness via Lagrangian Relaxation and Branch-and-Bound: Caio Corro, Joseph Le Roux, Mathieu Lacroix, Antoine Rozenknop, Roberto Wolfler Calvo
ACL 2016 - Annual Meeting of the Association for Computational Linguistics
paper - slides

Méthode lagrangienne pour les arborescences couvrantes avec application en traitement automatique des langues: Caio Corro, Joseph Le Roux, Mathieu Lacroix, Antoine Rozenknop, Roberto Wolfler Calvo
ROADEF 2016
link