topicmodeldiscovery

Topic Modeling as a Tool for Resource Discovery

View the Project on GitHub efkuehn/topicmodeldiscovery

Import Saved Model to Explore

The visualization part should be included in the part of creating the topic model. It helps to be able to see and explore the topic model in order to refine the parameters. But for the sake of better web presentation, we are going to keep these as seperate Jupyter Notebooks.

The following two cells import the different gensim modules and stored models and asorting files for the running of the visualizations.

import json
from gensim import corpora 
from gensim.models.ldamodel import LdaModel 
from gensim.corpora.dictionary import Dictionary
lda_model = LdaModel.load('./models/PrelimTOpicModel2') 
corpus_dict = Dictionary.load_from_text('./models/corpus_dictionary_2')
with open('./models/corpus.json', 'r') as fp:
    corpus = json.load(fp)

Visualizing LDA Model

For this excercise we will use the pyLDAvis. This visualization reduces the vectors of each document and creates a 2 dimensional representation as to how the topics relate to one another. Its a helpful heuristic to explore the topics.

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

One of the problems with pyLDAvis is that it will tend to sort the topics and use that numbering. This is why we have selected the parameter sort_topic=False, but even with this set to false, the topics from the gensim model are zero indexed, and pyLDAvis resets the index to one. This makes the topic exploration a bit frustrating.

pyLDAvis.enable_notebook()
# The sort_topics=False makes the topic model numbers agree [+1] with the topic model from gensim
# Gensim's topic numbers' are zero indexed, and the vis index is 1 indexed
vis = pyLDAvis.gensim.prepare(lda_model, corpus, corpus_dict, sort_topics=False, mds='mmds')
/Users/sgoodwin/Library/Python/3.7/lib/python/site-packages/pyLDAvis/_prepare.py:257: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  return pd.concat([default_term_info] + list(topic_dfs))
# run the variable `vis` if you want to see the pyLDAvis model.
vis

To save the topic model, uncomment the following line and it will save the html for this.

# pyLDAvis.save_html(vis, 'PrelimTopicModel_2.html') # Also 'PerlimTopicModel_1.html'

Find the Dominant Document For Each Topic

import pandas as pd
# this creates a pandas DataFrame that orders all of the topics and shows the dominant topic for each document
def format_topics_sent(ldamodel, corpus, texts):
    sent_topics_df = pd.DataFrame()
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row[0], key=lambda x: x[1], reverse=True)
        
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_topic', 'Perc_Contrib', 'Topic_Keywords']
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return sent_topics_df