Topic Modeling as a Tool for Resource Discovery
The script pool_process.py
runs on 7 different CPU cores, to spits out the
text into ./output/analalyzed_corpus.json
. One of the big problems with
running NLTK in multiprocessing is that the WordNetLemmetizer has to be
initialized first, before the threading. Read more about it in this
stackoverflow answer. This script creates text file
that needs a bit of massaging to create a python readable dictionary. Part of
the goal of this script is to make a pandas dataframe.
# thanks to @ninjaaron for help pulling this script section together
from collections import namedtuple
import json
encode = json.JSONEncoder(ensure_ascii=False).encode
#Book = namedtuple("Book", "ht_id, top_topic, best_match, most_common_topic")
Book = namedtuple("Book", ["ht_id", "year", "subjects", "top_topic", "best_match", "most_common_topic"])
Topic = namedtuple("Topic", "top_num, perc")
BestMatch = namedtuple("BestMatch", "page, top_num, perc")
with open("analyzed_corpus3.json") as fh:
books = eval(fh.read())
The Following code takes the hierarchical dataframe, and flatens it out to be easily imported into a pandas dataframe.
analyzed_list = []
for book in books.values():
out_dict = {}
try:
dct = book._asdict()
except AttributeError:
pass
out_dict['ht_id'] = dct['ht_id']
out_dict['year'] = dct['year']
out_dict['subjects'] = dct['subjects']
for key, value in dct.items():
try:
for inner_key, inner_value in value._asdict().items():
out_dict[key+'_'+inner_key] = inner_value
# dct[key] = value._asdict()
except AttributeError:
pass
analyzed_list.append(out_dict)
# This is what each line looks like after we open in it.
analyzed_list[0]
{'ht_id': 'mdp.39015019393407',
'year': '1974',
'subjects': 'Political science History | Conservatism History. | Democracy History. | Liberalism History.',
'top_topic_top_num': 5,
'top_topic_perc': 0.3887231835384959,
'best_match_page': 162,
'best_match_top_num': 5,
'best_match_perc': 0.8628564,
'most_common_topic_top_num': 5,
'most_common_topic_perc': 97}
Next we create the dataframe we will use, and reorder the columns.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.DataFrame(analyzed_list)
columns = [
'ht_id',
'year',
'subjects',
'top_topic_top_num',
'top_topic_perc',
'best_match_page',
'best_match_top_num',
'best_match_perc',
'most_common_topic_top_num',
'most_common_topic_perc',
]
df = df[columns]
The year publication date used for this, can easily be retrieved from the volume output. The year date was pulled in through the HathiTrust FeatureReader. The entire Marc record is available through the feature reader. Unfortunately, I forgot to include the title information in the pool_process.py
.
But we do need to add our selected Topic titles.
topic_num2name = {
0: 'Black Experience',
1: 'Context of Migrant Experience',
3: 'Communal Experience',
5: 'Social, Political, Economic Migrations',
6: 'Immigration and American Christianity',
11: 'Religion and Culture',
}
import sqlite3
# Unfortunately this data is not available in the github repo, but won't be necessary if you include
# The proper additions in the `pool_process.py` script
conn = sqlite3.connect('../data/politheo.db')
cur = conn.cursor()
'''
def get_year(row):
ht_id = row['ht_id']
query = 'SELECT date FROM hathitrust_rec WHERE htitem_id = ?'
cur.execute(query, (ht_id, ))
year = cur.fetchone()
year = int(year[0].split('-')[0])
return year
df['date'] = df.apply(get_year, axis=1)
df.columns
'''
"\ndef get_year(row):\n ht_id = row['ht_id']\n query = 'SELECT date FROM hathitrust_rec WHERE htitem_id = ?'\n cur.execute(query, (ht_id, ))\n year = cur.fetchone()\n year = int(year[0].split('-')[0])\n return year\n \n \ndf['date'] = df.apply(get_year, axis=1)\ndf.columns\n"
Sorting the date by the decade.
df['decade'] = df.apply(lambda x: (int(x['year'])//10)*10, axis=1)
This chart shows how many pages are dominated by a particular topic over time. This is grouped by decade.
most_common_df = df.groupby(['most_common_topic_top_num', 'decade'])['most_common_topic_perc'].sum().unstack('most_common_topic_top_num')
most_common_df.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x117c9f410>
This final part saves the script in a png file.
most_common_plot = most_common_df.plot(kind='bar', stacked=True).get_figure()
most_common_plot.savefig('./output/most_common_plot.png')
The top_topic column is an average rating of how well the topics match a particular volume. This chart takes an average of those averages by decade to see if there are any decades which have particular highlights on which books where were.
top_topic_df = df.groupby(['top_topic_top_num', 'decade'])['top_topic_perc'].mean().unstack('top_topic_top_num')
top_topic_df.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x117e95990>
# This code saves the chart to a graph.
top_topic_fig = top_topic_df.plot(kind='bar', stacked=True).get_figure()
top_topic_fig.savefig('./output/top_topic_fig.png')
I tried a couple of approaches to find the best match. The Best match was calculated by finding the topic that best matched the model, and showing which page in the volume that came from. I thought it might be interesting to see which decade had the best, best match. But it turns out that this wasn’t a particularly helpful measure.
best_match_df = df.groupby(['best_match_top_num', 'decade'])['best_match_perc'].max().unstack('best_match_top_num')
best_match_df.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1184b9490>
Digging into the actual dataframe that produces the chart is a little more interesting. For the decades that only have one or two matches, it is because there are only one or two books that represent that decade. This did lead to the realization that we need a more percise way of looking at the best match data.
best_match_df
best_match_top_num | 0 | 1 | 3 | 5 | 6 | 11 |
---|---|---|---|---|---|---|
decade | ||||||
0 | 0.555335 | 0.862857 | NaN | 0.880000 | 0.680000 | NaN |
1740 | NaN | 0.759997 | NaN | NaN | NaN | NaN |
1760 | NaN | NaN | NaN | NaN | 0.696676 | NaN |
1780 | NaN | NaN | NaN | NaN | NaN | 0.808000 |
1790 | NaN | NaN | NaN | 0.679999 | 0.862856 | NaN |
1810 | 0.985231 | NaN | NaN | 0.906667 | 0.893333 | 0.839999 |
1820 | 0.951826 | NaN | NaN | 0.931428 | 0.931428 | 0.755001 |
1830 | 0.840000 | 0.807999 | NaN | 0.893333 | 0.833935 | NaN |
1840 | 0.842599 | 0.807999 | NaN | 0.804002 | 0.862857 | NaN |
1850 | 0.840000 | 0.879999 | NaN | 0.687734 | 0.917567 | NaN |
1860 | 0.679998 | NaN | NaN | 0.839999 | 0.760000 | NaN |
1870 | NaN | 0.807998 | 0.807998 | 0.879999 | NaN | NaN |
1880 | 0.840000 | 0.839999 | NaN | 0.879999 | 0.880000 | NaN |
1890 | 0.673333 | 0.904000 | 0.808000 | 0.958261 | NaN | 0.804000 |
1900 | 0.760000 | 0.880000 | NaN | 0.940000 | 0.862856 | NaN |
1910 | 0.807999 | 0.740874 | 0.519999 | 0.903999 | 0.893333 | 0.840000 |
1920 | 0.808000 | 0.807999 | 0.893333 | 0.903999 | 0.868341 | 0.671111 |
1930 | 0.860000 | 0.829560 | 0.734630 | 0.949473 | 0.893333 | 0.718704 |
1940 | 0.862857 | 0.863058 | 0.617596 | 0.943529 | 0.970402 | 0.671346 |
1950 | 0.807999 | 0.807999 | 0.903999 | 0.943529 | 0.896471 | 0.912727 |
1960 | 0.912727 | 0.929932 | 0.903999 | 0.952000 | 0.936000 | 0.862857 |
1970 | 0.952000 | 0.880000 | 0.956364 | 0.959167 | 0.971765 | 0.920000 |
1980 | 0.900811 | 0.912727 | 0.987027 | 0.982545 | 0.979575 | 0.920000 |
1990 | 0.958261 | 0.961600 | 0.893333 | 0.969032 | 0.979130 | 0.936000 |
2000 | 0.913540 | 0.912727 | 0.986426 | 0.989327 | 0.972000 | 0.949474 |
2010 | 0.679998 | NaN | NaN | 0.880000 | NaN | NaN |
9990 | 0.956364 | 0.931428 | 0.880000 | 0.954286 | 0.972571 | 0.953334 |
I thought it would be worth digging into the data to see which books had high matches. As well as a high representative of that topic across the book. The first script creates a data frame where the most common topic and the best match topic are the same. This will help promote the aboutness of a particular work will be about the topic we are interseted in.
# The Top Five Pages for Each topic, Title, and Year
# df.groupby('best_match_top_num')['best_match_top_num', 'ht_id', 'best_match_page', 'best_match_perc'].head()
bmatch_df = df[df['most_common_topic_top_num'] == df['best_match_top_num']].sort_values(by=['best_match_top_num', 'best_match_perc'], ascending=False)
# The colomuns of the topic are still all the ones we have
bmatch_df.columns
Index(['ht_id', 'year', 'subjects', 'top_topic_top_num', 'top_topic_perc',
'best_match_page', 'best_match_top_num', 'best_match_perc',
'most_common_topic_top_num', 'most_common_topic_perc', 'decade'],
dtype='object')
These two functions can add additional information from the Hathi Trust database created in the Political Theological project. The title could also be added to the dataframes when the entire corpus is run over the data.
def find_title(row):
query = 'SELECT title FROM hathitrust_rec WHERE htitem_id = ?'
cur.execute(query, (row['ht_id'], ))
title = cur.fetchone()
return title[0]
'''
def find_subjects(row):
query = 'SELECT subject_heading FROM htitem2subjhead WHERE htitem_id = ?'
cur.execute(query, (row['ht_id'], ))
subjects = cur.fetchall()
return ' | '.join([x[0] for x in subjects])
'''
"\n\ndef find_subjects(row):\n query = 'SELECT subject_heading FROM htitem2subjhead WHERE htitem_id = ?'\n cur.execute(query, (row['ht_id'], ))\n subjects = cur.fetchall()\n return ' | '.join([x[0] for x in subjects])\n"
# apply find_title to the dataframe
bmatch_df['title'] = bmatch_df.apply(find_title, axis=1)
# add the topic name to the dataframe
bmatch_df['top_nam'] = bmatch_df.apply(lambda x: topic_num2name[x['best_match_top_num']], axis=1)
# add the subjects to the dataframe.
# Subjects likewise are available in the hathitrust record reader
# bmatch_df['subjects'] = bmatch_df.apply(find_subjects, axis=1)
# This cell reorders the columns in a more intuitive order
bmatch_df = bmatch_df[
['top_nam',
'best_match_top_num',
'ht_id',
'title',
'year',
'subjects',
'best_match_page',
'best_match_perc',
'most_common_topic_top_num',
'most_common_topic_perc'
]
].reset_index(drop=True)
The following DataFrame shows the top five records for each of the Topic Models
bmatch_df.groupby('best_match_top_num').head(5)
top_nam | best_match_top_num | ht_id | title | year | subjects | best_match_page | best_match_perc | most_common_topic_top_num | most_common_topic_perc | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Religion and Culture | 11 | mdp.39015063320546 | New approaches to the study of religion / edit... | 9999 | Religion Study and teaching History 20th century. | 501 | 0.953334 | 11 | 166 |
1 | Religion and Culture | 11 | mdp.39015063320546 | New approaches to the study of religion / edit... | 9999 | Religion Study and teaching History 20th century. | 501 | 0.953334 | 11 | 166 |
2 | Religion and Culture | 11 | uva.x006167926 | Communio viatorum. v.43-44 2001-2002 | 2002 | Theology Periodicals. | 563 | 0.949474 | 11 | 64 |
3 | Religion and Culture | 11 | inu.30000004992420 | Synthesis philosophica. | 9999 | Philosophy Periodicals. | 13 | 0.926154 | 11 | 84 |
4 | Religion and Culture | 11 | uva.x006090923 | Post-theism : reframing the Judeo-Christian tr... | 2000 | Theism. | Christianity and other religions Jud... | 102 | 0.926154 | 11 | 122 |
302 | Immigration and American Christianity | 6 | uva.x000685789 | Nairobi to Vancouver : 1975-1983 : Report of t... | 1983 | Ecumenical movement Congresses. | 267 | 0.979575 | 6 | 96 |
303 | Immigration and American Christianity | 6 | ien.35556030059356 | Catholic Eastern Churches : heritage and ident... | 1994 | Catholic Church Malabar rite. | Catholic Churc... | 28 | 0.974737 | 6 | 173 |
304 | Immigration and American Christianity | 6 | mdp.39015021629806 | The Mennonite encyclopedia : a comprehensive r... | 9999 | Anabaptists Dictionaries. | Mennonites Diction... | 836 | 0.972571 | 6 | 514 |
305 | Immigration and American Christianity | 6 | mdp.39015021629806 | The Mennonite encyclopedia : a comprehensive r... | 9999 | Anabaptists Dictionaries. | Mennonites Diction... | 836 | 0.972571 | 6 | 514 |
306 | Immigration and American Christianity | 6 | uva.x030152181 | Grant$ for religion, religious welfare, & ... | 2007 | Endowments United States Directories. | Religi... | 232 | 0.972000 | 6 | 174 |
1764 | Social, Political, Economic Migrations | 5 | mdp.39015054048577 | Ulrich's periodicals directory. 2005 v.3 | 2005 | Periodicals Directories. | 917 | 0.989327 | 5 | 1386 |
1765 | Social, Political, Economic Migrations | 5 | mdp.39015047344901 | Index of conference proceedings received. 1983 | 1983 | Congresses and conventions Bibliography Catalogs. | 147 | 0.982545 | 5 | 419 |
1766 | Social, Political, Economic Migrations | 5 | mdp.39015047344901 | Index of conference proceedings received. 1983 | 1983 | Congresses and conventions Bibliography Catalogs. | 147 | 0.982545 | 5 | 419 |
1767 | Social, Political, Economic Migrations | 5 | uc1.b2505364 | British qualifications. 1988 (19th) | 1988 | Professional education Great Britain Directori... | 751 | 0.973333 | 5 | 148 |
1768 | Social, Political, Economic Migrations | 5 | uc1.b5032677 | British qualifications : a comprehensive guide... | 1985 | Professional education Great Britain Directori... | 310 | 0.972571 | 5 | 190 |
5751 | Communal Experience | 3 | mdp.39015049827705 | Economic relations between Scandinavia and ASE... | 1986 | ASEAN Economic relations Scandinavia. | Scandi... | 240 | 0.808000 | 3 | 84 |
5752 | Communal Experience | 3 | mdp.39015049827705 | Economic relations between Scandinavia and ASE... | 1986 | ASEAN Economic relations Scandinavia. | Scandi... | 240 | 0.808000 | 3 | 84 |
5753 | Communal Experience | 3 | mdp.39015074694293 | Report. v.16-17 1890-1892 | 1892 | 259 | 0.760000 | 3 | 21 | |
5754 | Communal Experience | 3 | uiug.30112121408451 | Who's who in New England; a biographical dicti... | 1938 | New England Biography. | 405 | 0.734630 | 3 | 510 |
5755 | Communal Experience | 3 | uc1.$b398050 | Joy of the worm. | 1969 | 128 | 0.720001 | 3 | 5 | |
5758 | Context of Migrant Experience | 1 | inu.30000067911481 | Agwọ Loro ibe ya in Imo State : omen or proph... | 1999 | Imo State (Nigeria) Moral conditions. | Imo St... | 176 | 0.961600 | 1 | 37 |
5759 | Context of Migrant Experience | 1 | mdp.39015032357645 | Special report - Center for Southeast Asian St... | 1994 | Southeast Asia. | 101 | 0.940000 | 1 | 48 |
5760 | Context of Migrant Experience | 1 | mdp.39015060606442 | Sublime historical experience / Frank Ankersmit. | 2005 | History Philosophy. | 489 | 0.912727 | 1 | 174 |
5761 | Context of Migrant Experience | 1 | mdp.39015060606442 | Sublime historical experience / Frank Ankersmit. | 2005 | History Philosophy. | 489 | 0.912727 | 1 | 174 |
5762 | Context of Migrant Experience | 1 | mdp.39015060606442 | Sublime historical experience / Frank Ankersmit. | 2005 | History Philosophy. | 489 | 0.912727 | 1 | 174 |
6108 | Black Experience | 0 | mdp.39015068382343 | The cyclopædia; or, Universal dictionary of ar... | 1819 | Encyclopedias and dictionaries. | 352 | 0.985231 | 0 | 35 |
6109 | Black Experience | 0 | njp.32101078163753 | The cyclopædia, or, Universal dictionary of ar... | 1825 | Encyclopedias and dictionaries. | 361 | 0.951826 | 0 | 30 |
6110 | Black Experience | 0 | njp.32101078163753 | The cyclopædia, or, Universal dictionary of ar... | 1825 | Encyclopedias and dictionaries. | 361 | 0.951826 | 0 | 30 |
6111 | Black Experience | 0 | njp.32101078163753 | The cyclopædia, or, Universal dictionary of ar... | 1825 | Encyclopedias and dictionaries. | 361 | 0.951826 | 0 | 30 |
6112 | Black Experience | 0 | mdp.39015068388142 | The cyclopaedia; or, Universal dictionary of a... | 1824 | 361 | 0.950610 | 0 | 28 |
# This cell saves the dataframe to a
bmatch_df.to_excel('./output/best_match.xlsx', index=False)