spaCy

spaCy

spacy.io

4

About this website

spaCy is an industrial-strength natural language processing library for Python, created in 2015 by Matthew Honnibal and Ines Montani at Explosion AI. Written in Cython for maximum performance, it processes thousands of documents per second while maintaining a small memory footprint. The library provides a modular pipeline architecture with components including Tokenizer, Tagger, Morphologizer, Lemmatizer, Parser, EntityRecognizer (NER), EntityLinker, SpanCategorizer, TextCategorizer, and SentenceRecognizer. spaCy ships pre-trained models like en_core_web_sm (12MB), en_core_web_md (40MB), en_core_web_lg (560MB), and en_core_web_trf (438MB transformer model based on RoBERTa) for English, plus models for over 70 languages including German (de_core_news), French (fr_core_news), Chinese (zh_core_web), Japanese (ja_core_news), and Dutch (nl_core_news). NER labels follow the OntoNotes 5 scheme with types like PERSON, ORG, GPE, DATE, MONEY, PRODUCT, EVENT, and LAW. The Matcher and PhraseMatcher APIs support token-based rule matching with operators and quantifiers, while EntityRuler combines statistical and rule-based NER. The v3.x config system (config.cfg) defines all training parameters using a structured approach inspired by Thinc. spaCy integrates with Hugging Face transformers via spacy-transformers, supports custom pipeline components via Language.add_pipe(), and includes displaCy for dependency and entity visualization. The project has over 30,000 GitHub stars and is used by companies like Airbnb, Quora, and Mashable for production NLP workloads.

Statistics

4
Views
0
Clicks
0
Like
0
Dislike

Comments

Log In to post a comment

No comments yet. Be the first!