gradec.model.LDAModel
- class gradec.model.LDAModel(n_topics, max_iter=1000, alpha=None, beta=0.001, text_column='abstract', n_cores=1)[source]
Generate a latent Dirichlet allocation (LDA) topic model.
This class is a light wrapper around scikit-learn tools for tokenization and LDA.
- Parameters:
n_topics (
int) – Number of topics for topic model. This corresponds to the model’sn_componentsparameter. Must be an integer >= 1.max_iter (
int, optional) – Maximum number of iterations to use during model fitting. Default = 1000.alpha (
floator None, optional) – Thealphavalue for the model. This corresponds to the model’sdoc_topic_priorparameter. Default is None, which evaluates to1 / n_topics, as was used in :footcite:t:`poldrack2012discovering`.beta (
floator None, optional) – Thebetavalue for the model. This corresponds to the model’stopic_word_priorparameter. If None, it evaluates to1 / n_topics. Default is 0.001, which was used in :footcite:t:`poldrack2012discovering`.text_column (
str, optional) – The source of text to use for the model. This should correspond to an existing column in thetextsattribute. Default is “abstract”.n_cores (
int, optional) – Number of cores to use for parallelization. If <=0, defaults to using all available cores. Default is 1.
- Variables:
model (
LatentDirichletAllocation) –
Notes
Adapted from: https://github.com/neurostuff/NiMARE/blob/main/nimare/annotate/lda.py.
Latent Dirichlet allocation was first developed in :footcite:t:`blei2003latent`, and was first applied to neuroimaging articles in :footcite:t:`poldrack2012discovering`.
References
See also
CountVectorizerUsed to build a vocabulary of terms and their associated counts from texts in the
self.text_columnof the Dataset’stextsattribute.LatentDirichletAllocationUsed to train the LDA model.
- fit(dset, counts_df=None)[source]
Fit the LDA topic model to text from a Dataset.
- Parameters:
dset (
Dataset) – A Dataset with, at minimum, text available in theself.text_columncolumn of itstextsattribute.count_df (
pandas.DataFrame) – A DataFrame with feature counts for the model. The index is ‘id’, used for identifying studies. Other columns are features (e.g., unigrams and bigrams from Neurosynth), where each value is the number of times the feature is found in a given article.
- Returns:
dset (
Dataset) – A new Dataset with an updatedannotationsattribute.- Variables:
distributions (
dict) –A dictionary containing additional distributions produced by the model, including:
p_topic_g_word:numpy.ndarrayof shape (n_topics, n_tokens) containing the topic-term weights for the model.p_topic_g_word_df:pandas.DataFrameof shape (n_topics, n_tokens) containing the topic-term weights for the model.
- get_params(deep=True)
Get parameters for this estimator.
- Parameters:
deep (
bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.- Returns:
params (
dict) – Parameter names mapped to their values.
- classmethod load(filename, compressed=True)
Load a pickled class instance from file.
- Parameters:
filename (
str) – Name of file containing object.compressed (
bool, default=True) – If True, the file is assumed to be compressed and gzip will be used to load it. Otherwise, it will assume that the file is not compressed. Default = True.
- Returns:
obj (class object) – Loaded class object.
- save(filename, compress=True)
Pickle the class instance to the provided file.
- Parameters:
filename (
str) – File to which object will be saved.compress (
bool, optional) – If True, the file will be compressed with gzip. Otherwise, the uncompressed version will be saved. Default = True.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>so that it’s possible to update each component of a nested object.- Returns:
self