Use a set, not a list
Picking the right stop words is hard.
Once you have them, make using them fast.
Containment tests (in
and not in
) are fast on a
set
and slow on a
list
.
No matter how you get your stop words, always put them in a
set
.
Libraries that don’t give you a set…
import nltk.corpus
stopwords = set(nltk.corpus.stopwords.words('english'))
import stop_words
stopwords = set(stop_words.get_stop_words('english'))
Libraries that give you a set, making your code faster and cleaner…
import spacy.lang.en
type(spacy.lang.en.STOP_WORDS) is set
import sklearn.feature_extraction.stop_words
type(sklearn.feature_extraction.stop_words.ENGLISH_STOP_WORDS) is frozenset