The news media evolved from newspapers, tabloids, and magazines to a digital form such as online news platforms, blogs, social media feeds, and many news mobile app. News outlets benefitted from the widespread use of social media/mobile platforms by providing updated news in near real time to its subscribers.
It became easier for consumers to acquire the latest news at their fingertips. So, These digital media platforms become very powerful due to their easy accessibility to the world and ability to allow users to discuss and share ideas and debate over issues such as democracy, education, health, research and history.
However, apart from advantage false/fake news article on digital platforms are getting very common and mainly used with a negative intent for their own benifit such as political and finacial benefit, creating biased opinions, manipulating mindsets, and spreading absurdity.
The primary goal of the project is to develop a robust machine learning model capable of accurately classifying news articles as either real or fake. Deep learning model to be evaluated includes RNN, LSTM and GRU.
Fake news is the deliberate presentation of (typically) false or misleading claims as news, where the claims are misleading by design.
In this study I have used the Fake news dataset from Kaggle to classify unreliable news articles as Fake news using Deep learning Technique Sequence to Sequence programming. A full training dataset with the following attributes:
- id : unique id for a news article
- title: the title of a news article
- author: author of the news article
- text : the text of the article; could be incomplete
- label : a label that marks the article as potentially unreliable
(1 : unreliable, 0 : reliable)
(i) Business metrics : Fake News Detection Rate (month-on-month, weekly/quarterly) etc.
(ii) Data-related metrics : F1-score, Recall, Precision Recall = TP/(TP + FN) Precision = TP/(TP + FP) F1-score = Harmonic mean of Recall and Precision where, TP = True Positive, FP = False Positive and FN = False Negative
Collaboration with Product Engineering: Work with product engineers to integrate the model into the news platform. Ensure that the model's predictions are seamlessly incorporated into the user experience, such as through warnings or article labeling. (not part of this project as of now)
Collaboration with DevOps: Collaborate with DevOps to deploy the model in a scalable and efficient manner. Implement automated model retraining pipelines to keep the model up-to-date. Monitor the model's performance in production and quickly address any issues or drift. (not part of this project as of now)
# Download data source
#!wget -P ./../input/ https://www.kaggle.com/c/fake-news/data
#!wget -P ./../resource/glove/ http://nlp.stanford.edu/data/glove.6B.zip
#!unzip ./../resource/glove/glove.6B.zip -d ./../resource/glove/
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
def read_data(filename,**kwargs):
raw_data=pd.read_csv(filename,**kwargs)
return raw_data
# Setup Directory Path
root_dir = str(Path().resolve().parent)
input_dir = root_dir+"/input/"
output_dir = root_dir+"/output/"
model_dir = root_dir+"/model/"
image_dir = 'images/'
# READ DATASET
news_df= read_data("../input/train.csv")
submit_test = read_data('../input/test.csv')
submit_label = read_data('../input/submit.csv')
submit_test['label'] = submit_label.label
print(" Shape of News data :: ", news_df.shape)
print(" News data columns", news_df.columns)
print(" Test columns", submit_test.columns)
news_df.head()
Shape of News data :: (20800, 5) News data columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object') Test columns Index(['id', 'title', 'author', 'text', 'label'], dtype='object')
id | title | author | text | label | |
---|---|---|---|---|---|
0 | 0 | House Dem Aide: We Didn’t Even See Comey’s Let... | Darrell Lucus | House Dem Aide: We Didn’t Even See Comey’s Let... | 1 |
1 | 1 | FLYNN: Hillary Clinton, Big Woman on Campus - ... | Daniel J. Flynn | Ever get the feeling your life circles the rou... | 0 |
2 | 2 | Why the Truth Might Get You Fired | Consortiumnews.com | Why the Truth Might Get You Fired October 29, ... | 1 |
3 | 3 | 15 Civilians Killed In Single US Airstrike Hav... | Jessica Purkiss | Videos 15 Civilians Killed In Single US Airstr... | 1 |
4 | 4 | Iranian woman jailed for fictional unpublished... | Howard Portnoy | Print \nAn Iranian woman has been sentenced to... | 1 |
submit_test.head()
id | title | author | text | label | |
---|---|---|---|---|---|
0 | 20800 | Specter of Trump Loosens Tongues, if Not Purse... | David Streitfeld | PALO ALTO, Calif. — After years of scorning... | 0 |
1 | 20801 | Russian warships ready to strike terrorists ne... | NaN | Russian warships ready to strike terrorists ne... | 1 |
2 | 20802 | #NoDAPL: Native American Leaders Vow to Stay A... | Common Dreams | Videos #NoDAPL: Native American Leaders Vow to... | 0 |
3 | 20803 | Tim Tebow Will Attempt Another Comeback, This ... | Daniel Victor | If at first you don’t succeed, try a different... | 1 |
4 | 20804 | Keiser Report: Meme Wars (E995) | Truth Broadcast Network | 42 mins ago 1 Views 0 Comments 0 Likes 'For th... | 1 |
#Text Word startistics: min.mean, max and interquartile range
text_len = news_df.text.str.split().str.len()
text_len.describe()
count 20761.000000 mean 760.308126 std 869.525988 min 0.000000 25% 269.000000 50% 556.000000 75% 1052.000000 max 24234.000000 Name: text, dtype: float64
#Title startistics
title_len = news_df.title.str.split().str.len()
title_len.describe()
count 20242.000000 mean 12.420709 std 4.098735 min 1.000000 25% 10.000000 50% 13.000000 75% 15.000000 max 72.000000 Name: title, dtype: float64
The statistic of Train and Test datasets are following:
From following columns ['id', 'title', 'author', 'text', 'label'] we will not include id and author.
Our experiment would be with both text and title together
sns.countplot(x="label", data=news_df);
plt.show()
Sequences is a predictive modelling problem, in which you have a certain sequence of entries, and the task is to predict the next sequence. The input and output sequence could vary one to many sequence. The difficulty of this problem lies in the fact that sequences can vary in length.
Sequence Problem (Many-to-One): We are going to predict label of multiple words etc.
Note: Challenge is to handle large input and output sequence.
Recurrent Neural Networks (RNN) are designed to work with sequential data. RNN uses the previous information in the sequence to produce the current output.
An LSTM has a similar control flow as a recurrent neural network but the differences are the operations within the LSTM’s cells that it propogates long sequence information. It solve the problem of short-term memory and vanishing gradients.
The workflow of GRU is same as LSTM but the difference is in the operations inside the GRU unit.
Many to One Sequence problem: The news article is sequence of multiple words whereas output has only label i.e, one word.
## Constants Used for cleaning the datasets
column_names = ['id', 'title', 'author', 'text', 'label']
remove_columns = ['id','author']
categorical_features = []
target_col = ['label']
text_features = ['title', 'text']
## Clean Datasets
import nltk
from nltk.corpus import stopwords
#nltk.download('stopwords')
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from collections import Counter
ps = PorterStemmer()
wnl = nltk.stem.WordNetLemmatizer()
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
# Removed unused clumns
def remove_unused_columns(df,column_names=remove_columns):
df = df.drop(column_names,axis=1)
return df
# Impute null values with None
def null_processing(feature_df):
for col in text_features:
feature_df.loc[feature_df[col].isnull(), col] = "None"
return feature_df
def clean_datasets(df):
# remove unused column
df = remove_unused_columns(df)
#impute null values
df = null_processing(df)
return df
## Cleaning text from unused characters
def clean_text(text):
text = str(text).replace(r'http[\w:/\.]+', ' ') # removing urls
text = str(text).replace(r'[^\.\w\s]', ' ') # remove everything but characters and punctuation
text = str(text).replace('[^a-zA-Z]', ' ')
text = str(text).replace(r'\s\s+', ' ')
text = text.lower().strip()
#text = ' '.join(text)
return text
## Nltk Preprocessing include:
# Stop words,
# Stemming and
# Lemmetization
# For our project we use only Stop word removal
def nltk_preprocesing(text):
text = clean_text(text)
wordlist = re.sub(r'[^\w\s]', '', text).split()
#text = ' '.join([word for word in wordlist if word not in stopwords_dict])
#text = [ps.stem(word) for word in wordlist if not word in stopwords_dict]
text = ' '.join([wnl.lemmatize(word) for word in wordlist if word not in stopwords_dict])
return text
df = clean_datasets(news_df)
df_test = clean_datasets(submit_test)
df["text"] = df.text.apply(nltk_preprocesing)
df_test["text"] = df_test.text.apply(nltk_preprocesing)
df["title"] = df.title.apply(nltk_preprocesing)
df_test["title"] = df_test.title.apply(nltk_preprocesing)
df.head()
title | text | label | |
---|---|---|---|
0 | house dem aide didnt even see comeys letter ja... | house dem aide didnt even see comeys letter ja... | 1 |
1 | flynn hillary clinton big woman campus breitbart | ever get feeling life circle roundabout rather... | 0 |
2 | truth might get fired | truth might get fired october 29 2016 tension ... | 1 |
3 | 15 civilian killed single u airstrike identified | video 15 civilian killed single u airstrike id... | 1 |
4 | iranian woman jailed fictional unpublished sto... | print iranian woman sentenced six year prison ... | 1 |
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
wordcloud = WordCloud( background_color='black', width=800, height=600)
text_cloud = wordcloud.generate(' '.join(df['text']))
plt.figure(figsize=(20,30))
plt.imshow(text_cloud)
plt.axis('off')
plt.show()
true_news = ' '.join(df[df['label']==0]['text'])
wc = wordcloud.generate(true_news)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()
fake_news = ' '.join(df[df['label']==1]['text'])
wc= wordcloud.generate(fake_news)
plt.figure(figsize=(20,30))
plt.imshow(wc)
plt.axis('off')
plt.show()
An n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
true_bigrams = (pd.Series(nltk.ngrams(true_news.split(), 2)).value_counts())[:20]
true_bigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('Top 20 Frequently Occuring True news Bigrams')
plt.ylabel('Bigram')
plt.xlabel('Number of Occurances')
plt.show()
fake_bigrams = (pd.Series(nltk.ngrams(fake_news.split(), 2)).value_counts())[:20]
fake_bigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('Top 20 Frequently Occuring Fake news Bigrams')
plt.ylabel('Bigram')
plt.xlabel('Number of Occurances')
plt.show()
true_bigrams = (pd.Series(nltk.ngrams(true_news.split(), 3)).value_counts())[:20]
true_bigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('Top 20 Frequently Occuring True news Trigrams')
plt.ylabel('Trigrams')
plt.xlabel('Number of Occurances')
plt.show()
fake_bigrams = (pd.Series(nltk.ngrams(fake_news.split(), 3)).value_counts())[:20]
fake_bigrams.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('Top 20 Frequently Occuring Fake news Trigrams')
plt.ylabel('Trigrams')
plt.xlabel('Number of Occurances')
plt.show()
## Merge Text features together
def merge_text_features(df, text_featuers = text_features):
df['news']=df[text_featuers].agg(' '.join, axis=1)
print("Merge news text statistics::\n ",df.news.str.split().str.len().describe())
return df
## Preperaing Datasets
def preparing_datasets(df):
XY = merge_text_features(df)
#XY["news"] = XY.news.apply(clean_text)
print(" Cleaning as remove special character is done..")
print(XY.head())
#XY["news"] = XY.news.apply(nltk_preprocesing)
X = XY['news']
y = XY.label
print("Text len statistic after Merge news and preprocessing::\n ",X.str.split().str.len().describe())
if y.dtype=='object':
y= process_labels(y)
return X,y
## preprocessing datasets
print(" Training data preprocessing ")
X,y = preparing_datasets(df)
print(" Test data preprocessing ")
X_test,y_test = preparing_datasets(df_test)
Training data preprocessing Merge news text statistics:: count 20800.000000 mean 440.236490 std 490.971666 min 1.000000 25% 166.000000 50% 331.000000 75% 602.000000 max 20731.000000 Name: news, dtype: float64 Cleaning as remove special character is done.. title \ 0 house dem aide didnt even see comeys letter ja... 1 flynn hillary clinton big woman campus breitbart 2 truth might get fired 3 15 civilian killed single u airstrike identified 4 iranian woman jailed fictional unpublished sto... text label \ 0 house dem aide didnt even see comeys letter ja... 1 1 ever get feeling life circle roundabout rather... 0 2 truth might get fired october 29 2016 tension ... 1 3 video 15 civilian killed single u airstrike id... 1 4 print iranian woman sentenced six year prison ... 1 news 0 house dem aide didnt even see comeys letter ja... 1 flynn hillary clinton big woman campus breitba... 2 truth might get fired truth might get fired oc... 3 15 civilian killed single u airstrike identifi... 4 iranian woman jailed fictional unpublished sto... Text len statistic after Merge news and preprocessing:: count 20800.000000 mean 440.236490 std 490.971666 min 1.000000 25% 166.000000 50% 331.000000 75% 602.000000 max 20731.000000 Name: news, dtype: float64 Test data preprocessing Merge news text statistics:: count 5200.000000 mean 449.020192 std 478.439019 min 2.000000 25% 174.000000 50% 342.000000 75% 619.000000 max 11231.000000 Name: news, dtype: float64 Cleaning as remove special character is done.. title \ 0 specter trump loosens tongue purse string sili... 1 russian warship ready strike terrorist near al... 2 nodapl native american leader vow stay winter ... 3 tim tebow attempt another comeback time baseba... 4 keiser report meme war e995 text label \ 0 palo alto calif year scorning political proces... 0 1 russian warship ready strike terrorist near al... 1 2 video nodapl native american leader vow stay w... 0 3 first dont succeed try different sport tim teb... 1 4 42 min ago 1 view 0 comment 0 like first time ... 1 news 0 specter trump loosens tongue purse string sili... 1 russian warship ready strike terrorist near al... 2 nodapl native american leader vow stay winter ... 3 tim tebow attempt another comeback time baseba... 4 keiser report meme war e995 42 min ago 1 view ... Text len statistic after Merge news and preprocessing:: count 5200.000000 mean 449.020192 std 478.439019 min 2.000000 25% 174.000000 50% 342.000000 75% 619.000000 max 11231.000000 Name: news, dtype: float64
We can observe that mean number word for each news record is 471 and 75% quartile words length is 650 only. So based on this statistics, we can fix our word sequence using any suitable size for all news length as equal size.
Keras tokenizer to convert each text into a sequence of words, and then create the vocabulary using method on the tokenizer.
from keras_preprocessing.sequence import pad_sequences
from keras_preprocessing.text import Tokenizer
## Split datasets into train test sets to evalute mode with test size: 20%
#X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
X_train,y_train = X,y
print("Train data counts::",X_train.shape)
print("Test data counts::",X_test.shape)
oov_token = "<OOV>" # it will be added to word_index and used to replace
#out-of-vocabulary words during text_to_sequence calls
vocab_size = 100000 #the maximum number of words to keep, based
#on word frequency
#tokenizer = Tokenizer(oov_token=oov_token)
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
text = X_train#['text']
tokenizer.fit_on_texts(text)
word_index = tokenizer.word_index
print("Word Index ")
print(len(word_index.keys()))
max_text_length = 100
#vocab_size = len(word_index)
Train data counts:: (20800,) Test data counts:: (5200,) Word Index 201482
def prepare_seqence_data(df,tokenizer):
# Transforms each text in texts to a sequence of integers.
# Create Sequence
print(" Create Sequence of tokens ")
text_sequences = tokenizer.texts_to_sequences(df)
print("Text to sequence of Id:: ", text_sequences[0:1])
return text_sequences
Padding used to make length of sequence equal for all input
padding_type = "post" # Type of padding pre or post of input sequence
trunction_type="post" # Type of truncation used to truncate input sequence if exceed from maximum sequence length
def pad_sequence_data(text_sequences,max_text_length):
# Pad the Sequences, because the sequences are not of the same length,
# so let’s pad them to make them of similar length
text_padded = pad_sequences(text_sequences, maxlen=max_text_length, padding=padding_type,
truncating=trunction_type)
return text_padded
train_text_seq = prepare_seqence_data(X_train,tokenizer)
test_text_seq = prepare_seqence_data(X_test,tokenizer)
train_text_padded = pad_sequence_data(train_text_seq,max_text_length)
test_text_padded = pad_sequence_data(test_text_seq,max_text_length)
print("Padded Sequence :: ", test_text_padded[0:1])
print(" Tokenizer detail :: ", tokenizer.document_count)
print('Vocabulary size:', len(tokenizer.word_counts))
print('Shape of data padded:', train_text_padded.shape)
Create Sequence of tokens Text to sequence of Id:: [[41, 4596, 929, 262, 20, 63, 2247, 542, 2686, 6401, 2707, 41, 4596, 929, 262, 20, 63, 2247, 542, 2686, 6401, 2707, 11198, 22889, 297, 560, 78, 2927, 2686, 6401, 9271, 17, 11110, 3675, 576, 4143, 635, 1, 892, 2474, 1, 2123, 3737, 5166, 32565, 1100, 1417, 308, 27, 1, 204, 650, 507, 87, 41, 190, 929, 134, 14, 16, 40, 1, 308, 60, 393, 507, 541, 48798, 542, 3726, 179, 547, 69, 39, 984, 28, 13, 69, 1118, 4422, 141, 2904, 356, 262, 1063, 507, 133, 1026, 1382, 5, 47, 356, 753, 40, 507, 6517, 47, 753, 190, 4422, 101, 41, 388, 3676, 3764, 356, 245, 4985, 69, 452, 1359, 153, 63, 3026, 1472, 183, 113, 542, 304, 3764, 356, 753, 2686, 6401, 226, 67, 27, 17382, 1382, 179, 19067, 2468, 179, 1186, 2187, 69, 1068, 6729, 166, 82, 4366, 2686, 6401, 31076, 297, 1132, 78, 315, 40, 82, 507, 339, 233, 4985, 69, 616, 6314, 1, 40, 1826, 4033, 6852, 2681, 1328, 185, 66, 158, 262, 254, 6401, 3675, 47, 188, 3189, 8938, 11941, 166, 28, 1, 196, 26, 11, 1036, 1634, 643, 403, 1036, 1328, 6401, 332, 179, 188, 57, 1, 1382, 3868, 14755, 116, 8614, 318, 2797, 25044, 87, 498, 41, 190, 929, 32566, 542, 39, 196, 6401, 3358, 929, 79, 19438, 2964, 141, 262, 20, 40, 2247, 542, 1, 133, 4597, 136, 190, 4422, 101, 2904, 356, 262, 1425, 2247, 542, 47, 753, 158, 190, 4422, 101, 1425, 753, 3764, 21, 1306, 356, 2686, 6401, 2707, 56, 83, 238, 63, 851, 235, 37, 179, 204, 346, 6401, 1456, 356, 753, 282, 771, 1793, 1922, 2676, 166, 1548, 6401, 1411, 4143, 238, 190, 3677, 40, 361, 87, 929, 56, 193, 136, 188, 334, 505, 11942, 507, 1007, 1580, 1941, 542, 6401, 47, 1008, 10, 393, 4685, 947, 39, 30, 95, 1716, 324, 178, 20, 1878, 82, 324, 178, 1878, 507, 404, 10276, 7896, 44367, 1774, 236, 6401, 1677, 36, 30, 3206, 7384, 11198, 15853, 134, 14, 1085, 1304, 25045, 262, 20, 9215, 7728, 4422, 101, 15618, 15854, 163, 2676, 355, 18708, 1560, 810, 5833, 52, 40, 2591, 293, 6401, 722, 4547, 12071, 47, 734, 14036, 36168, 76393, 2774, 2912, 523, 3829, 1, 667, 4064, 2775, 9363, 6603, 92, 98, 671, 3062, 47, 41, 985, 298, 436, 108, 6401, 912, 5534, 11805, 355, 184, 343, 393, 1981, 616, 8375, 409, 41, 195, 47, 221, 16, 276, 1417, 308, 27, 11198, 22889, 11198, 32567, 2408, 194, 208, 1121, 4409, 674, 418, 123, 736, 393, 101, 1027, 37, 475, 4533, 1622, 1027, 37, 1417, 32568, 10101, 714, 12813, 589, 1790, 543, 3816, 2451, 1243, 3446, 7070, 20688, 1693, 39, 40, 505, 11942, 714, 4596, 7773, 323, 136, 32569, 3817, 359, 1562, 948, 11198, 18335, 31077, 3817]] Create Sequence of tokens Text to sequence of Id:: [[9385, 4, 39293, 7627, 9834, 3306, 3365, 1996, 8, 35, 10, 17865, 12244, 3186, 11, 55896, 67, 385, 3365, 1996, 14656, 10788, 2094, 15, 72, 441, 4, 1955, 1912, 269, 300, 826, 1336, 374, 1427, 2717, 8, 2187, 15768, 820, 8880, 3198, 844, 772, 13006, 924, 107, 707, 21934, 13791, 5, 1666, 1912, 31608, 25, 1514, 3, 4, 31, 7, 1561, 4999, 2621, 89, 4248, 3758, 415, 3185, 1912, 132, 1251, 327, 542, 18926, 3, 4, 38, 2031, 5667, 890, 250, 1762, 8041, 1526, 20570, 38092, 24, 456, 1315, 359, 925, 47, 1285, 951, 8, 35, 10, 214, 690, 3, 8041, 1008, 149, 849, 65, 108, 3, 4, 2960, 20, 386, 4973, 16, 452, 667, 149, 65, 2010, 3899, 108, 412, 5247, 13403, 47, 6889, 608, 1547, 469, 651, 489, 135, 310, 14260, 3365, 1996, 631, 117, 5014, 27, 287, 170, 9332, 9980, 121, 4488, 50, 14238, 1128, 3526, 1196, 1384, 1922, 25996, 620, 2045, 3639, 10619, 135, 197, 19008, 2, 3982, 33127, 1315, 281, 1912, 15344, 109, 1393, 10210, 2179, 182, 103, 76, 182, 1304, 3, 33127, 1517, 1854, 47, 190, 1285, 3, 4, 342, 2801, 28, 13, 459, 474, 3365, 1996, 549, 33, 94, 557, 1, 7734, 217, 2075, 53, 1427, 3022, 708, 209, 818, 791, 1, 364, 1818, 1874, 4022, 76, 192, 311, 4635, 60, 29, 708, 209, 73, 3, 33127, 2, 1973, 2260, 2788, 4158, 3365, 1996, 4543, 414, 20, 1554, 1279, 244, 900, 130, 244, 37895, 207, 504, 549, 1391, 88, 3, 13, 370, 588, 588, 65, 1912, 269, 87, 47753, 1190, 1818, 122, 671, 47753, 25, 15, 42, 1002, 1524, 65, 4488, 3198, 4357, 3472, 10261, 1807, 160, 11853, 484, 6034, 606, 7471, 3365, 1996, 8090, 904, 489, 3, 10261, 260, 42, 1427, 1008, 149, 65, 1701, 1469, 67, 250, 356, 163, 186, 4246, 60, 33, 94, 3, 10261, 9774, 2531, 5796, 233, 7, 3314, 347, 65, 875, 53, 3, 4, 476, 265, 4456, 1230, 104, 210, 1308, 18061, 3, 4, 225, 233, 882, 2384, 3, 10261, 4146, 1818, 236, 2098, 31, 2069, 39, 667, 2356, 209, 28, 539, 575, 200, 207, 1008, 1525, 115, 31, 349, 3, 10261, 26901, 1589, 69, 547, 3899, 2305, 1223, 186, 745, 4858, 20, 1701, 1469, 1002, 24687, 65, 31, 2069, 178, 13729, 806, 671, 1912, 2360, 17388, 1757, 1912, 234, 10, 100, 278, 61879, 3198, 844, 772, 27570, 9704, 48054, 21361, 667, 1393, 209, 15074, 2024, 82, 2, 995, 3630, 3, 13, 1026, 121, 212, 24, 10, 295, 1310, 1277, 173, 274, 1463, 90, 4424, 489, 2, 4, 795, 15, 7, 4066, 1677, 222, 4723, 9422, 145, 3, 13, 233, 7736, 2327, 2, 427, 3, 82, 667, 849, 209, 28, 539, 575, 6466, 8014, 1, 318, 1360, 1055, 391, 290, 3, 13, 108, 1427, 224, 108, 173, 411, 1489, 173, 510, 2, 1807, 1677, 361, 751, 16363, 99924, 1526, 359, 2, 1008, 373, 65, 1058, 190, 31, 217, 24, 10, 623, 74141, 17738, 3111, 173, 1223, 3074, 84, 170, 3, 13, 236, 9927, 39, 3, 4, 1912, 269, 7257, 3293, 15, 42, 2075, 15364, 3365, 1996, 133, 2495, 11640, 999, 1912, 42, 114, 968, 190, 114, 11640, 999, 335, 253, 823, 15, 42, 761, 1774, 143, 195, 3198, 4357, 403, 144, 3, 13, 6609, 590, 3365, 1996, 8450, 36, 344, 946, 331, 786, 9145, 434, 2894, 5019, 14962, 14873, 4249, 16, 1938, 420, 139, 4702, 1072, 95, 140, 134, 14, 365, 13, 38, 1223, 199, 4, 38, 1587, 8250, 20, 3365, 1996, 57, 3, 4, 2630, 9853, 38, 54851, 299, 146, 3630, 94, 3, 13, 3198, 844, 772, 428, 1929, 3198, 1460, 617, 662, 4498, 12314, 5064, 754, 102, 3365, 1996, 67, 2458, 90844, 2144, 31, 1912, 269, 39, 362, 4894, 385, 3395, 1278, 1912, 9, 48, 10347, 817, 417, 48, 173, 3, 33127, 2, 3472, 10261, 17012, 13086, 247, 2262, 174, 359, 3639, 2121, 67, 7405, 216, 39, 2777, 2064, 489, 999, 9093, 50116, 1526, 14190, 4649, 655, 68, 1398, 359, 228, 363, 3, 50116, 3705, 275, 209, 53, 2971, 4029, 460, 3, 13, 686, 53, 24, 8109, 2, 456, 4928, 4823, 201, 1662, 3, 50116, 79, 505, 4285, 1096, 304, 341, 255, 3711, 3639, 145, 7, 411, 63, 1615, 203, 332, 4833, 14, 206, 24639, 95, 10, 19, 4649, 655, 3130, 130, 1363, 3, 50116, 759, 607, 3580, 359, 1223, 199, 48, 4446, 60, 313, 31890, 2, 4205, 1334, 3198, 4357, 392, 38, 4454, 1912, 68, 224, 390, 31, 22, 1251, 415, 1864, 772, 97, 14412, 49689, 71862, 926, 6137, 569, 4446, 2798, 801, 3198, 772, 1, 4542, 144, 1391, 5219, 460, 870, 21934, 13791, 73, 189, 9418, 682, 869, 304, 10314, 11872, 1109, 744, 4770, 225, 1115, 512, 474, 889, 3740, 4955, 900, 2, 67164, 1, 1, 3198, 4357, 8906, 757, 304, 54, 22, 140, 2010, 3409]] Padded Sequence :: [[ 9385 4 39293 7627 9834 3306 3365 1996 8 35 10 17865 12244 3186 11 55896 67 385 3365 1996 14656 10788 2094 15 72 441 4 1955 1912 269 300 826 1336 374 1427 2717 8 2187 15768 820 8880 3198 844 772 13006 924 107 707 21934 13791 5 1666 1912 31608 25 1514 3 4 31 7 1561 4999 2621 89 4248 3758 415 3185 1912 132 1251 327 542 18926 3 4 38 2031 5667 890 250 1762 8041 1526 20570 38092 24 456 1315 359 925 47 1285 951 8 35 10 214 690 3]] Tokenizer detail :: 20800 Vocabulary size: 201481 Shape of data padded: (20800, 100)
## Constants for Word Embeddings
## Embedding Parameters
emb_dim = 100
embedding_type = 'glove'
glove_dir = root_dir+'/resource/glove.6B/'
glove_file = glove_dir+"glove.6B."+str(emb_dim)+"d.txt"
vocab_size = len(word_index)+1
## Create GLove Word embedding
from tensorflow.python.keras.layers import Embedding
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.preprocessing.text import one_hot
def read_glove_embedings():
word_vec = pd.read_table(glove_file, sep=r"\s", header=None, engine='python',
encoding='iso-8859-1', error_bad_lines=False)
word_vec.set_index(0, inplace=True)
print('Found %s word vectors.' % len(word_vec))
print('politics',word_vec.head())
return word_vec
# Golve embedding use tokenizer for
# word index, vocab size
def glove_embedings(tokenizer):
embeddings_index = read_glove_embedings()
embedding_matrix = np.zeros((vocab_size, emb_dim))
#embedding_weights = np.zeros((10000, 50))
index_n_word = [(i, tokenizer.index_word[i]) for i in range(1, len(embedding_matrix)) if
tokenizer.index_word[i] in embeddings_index.index]
idx, word = zip(*index_n_word)
embedding_matrix[idx, :] = embeddings_index.loc[word, :].values
return embedding_matrix
def onehot_embedding(tokenizer):
onehot_vec = [one_hot(words, (len(tokenizer.word_counts) +1)) for words in tokenizer.word_index.keys()]
embedded_docs = pad_sequences(onehot_vec, padding='pre', maxlen=max_text_length)
return embedded_docs
def build_embeddings(tokenizer):
vocab_len = vocab_size
print(" vocab_len ", vocab_size)
if embedding_type=='glove':
embedding_matrix = glove_embedings(tokenizer)
print(" Encoded word sequence:: ",embedding_matrix[0:10])
embeddingLayer = Embedding(input_dim=vocab_len, output_dim=emb_dim, input_length=max_text_length,
weights=[embedding_matrix], trainable=False)
elif embedding_type=='fasttext':
embedding_matrix = fasttext_embedings()
embeddingLayer = Embedding(input_dim=vocab_len, output_dim=emb_dim, input_length=max_text_length,
weights=[embedding_matrix], trainable=False)
else:
embedding_matrix = onehot_embedding(tokenizer)
embeddingLayer = Embedding(input_dim=vocab_len, output_dim=emb_dim, input_length=max_text_length,
trainable=False)
return embeddingLayer
embeding_layer = build_embeddings(tokenizer)
Dropout/Batchnormalization after hidden layer if rquire, which is regularization technique to avoid over-fitting/vanishing gradiant.
Multiple Dense layer if require
Output Layer:
#PARAMS
sequence_neuron_size = 100
hidden_layer_1 = 32
epochs = 20
batch_size = 256
classifier = 'binary'
import matplotlib.pyplot as plt
from tensorflow.python.keras import Input
from tensorflow.python.keras.layers import Bidirectional, LSTM, Dense, Dropout, BatchNormalization, GRU, SimpleRNN
from tensorflow.python.keras.models import Sequential
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd
from datetime import date
from os.path import exists
#Building Sequential network with
# Embeding Layer
# LSTM
# Dense
# Output Layer
def build_network_lstm(embedding_layer):
print(" Building Sequential network ")
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(sequence_neuron_size))#, return_sequences=True))
#model.add(LSTM(100))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))
return model
def build_network_GRU(embedding_layer):
print(" Building GRU network ")
model = Sequential()
model.add(embedding_layer)
model.add(GRU(sequence_neuron_size))#, return_sequences=True))
model.add(Dropout(0.3))
model.add(Dense(hidden_layer_1, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
return model
def build_network_RNN(embedding_layer):
print(" Building RNN network ")
model = Sequential()
model.add(embedding_layer)
model.add(SimpleRNN(sequence_neuron_size))#, return_sequences=True))
model.add(Dropout(0.3))
model.add(Dense(hidden_layer_1, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
return model
def train_model(model,X_train,y_train,X_test, y_test):
# Compile Model with loss function,
# optimizer and metricecs as minimum parameter
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
# Train model with Train and test set data
# Number of epochs, batch size as minimum parameter
history = model.fit(X_train, y_train, epochs=epochs,batch_size = batch_size ,validation_split=0.2)#validation_data=(X_test, y_test))
return model,history
def performance_history(history,model_type,name):
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
plt.savefig(model_dir + image_dir+ model_type+'/' + name + "_performance.jpeg")
def model_evaluation(model,X_test,y_test):
score = model.evaluate(X_test, y_test, verbose=0)
print(f'Test loss: {score[0]} / Test accuracy: {score[1]}')
return score
def store_model(model,model_type,name):
# Store the model as json and
# store model weights as HDF5
# serialize model to JSON
model_json = model.to_json()
with open(model_dir+model_type+'/'+name+"_model.json", "w") as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights(model_dir +model_type+'/'+ name + "_model.h5")
print("Saved model to disk")
def performance_report(model,testX,testy):
time = date.today()
yhat_probs = model.predict(testX, verbose=0)
# predict crisp classes for test set
yhat_classes = model.predict_classes(testX, verbose=0)
# reduce to 1d array
yhat_probs = yhat_probs[:, 0]
yhat_classes = yhat_classes[:, 0]
# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(testy, yhat_classes)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(testy, yhat_classes)
print('Precision: %f' % precision)
# recall: tp / (tp + fn)
recall = recall_score(testy, yhat_classes)
print('Recall: %f' % recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(testy, yhat_classes)
print('F1 score: %f' % f1)
if exists(output_dir + 'report.csv'):
total_cost_df = pd.read_csv(output_dir + 'report.csv', index_col=0)
else:
total_cost_df = pd.DataFrame(
columns=['time', 'name', 'Precision', 'Recall', 'f1_score', 'accuracy'])
total_cost_df = total_cost_df.append(
{'time': time, 'name': name,'Precision': precision, 'Recall': recall, 'f1_score': f1,'accuracy':accuracy},
ignore_index=True)
total_cost_df.to_csv(output_dir + 'report.csv')
dash = "-"
name = "Model_"
%%time
## Build Network
model_type='RNN'
epochs = 20
batch_size = 256
name = "Model_" + str(epochs)+dash+str(batch_size)+dash+str(max_text_length)+dash+str(vocab_size)+dash
model_rnn = build_network_RNN(embeding_layer)
model_rnn,history = train_model(model_rnn,train_text_padded,y_train,test_text_padded, y_test)
Building RNN network _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 100, 100) 20148300 _________________________________________________________________ simple_rnn_1 (SimpleRNN) (None, 100) 20100 _________________________________________________________________ dropout_3 (Dropout) (None, 100) 0 _________________________________________________________________ dense_3 (Dense) (None, 32) 3232 _________________________________________________________________ dropout_4 (Dropout) (None, 32) 0 _________________________________________________________________ dense_4 (Dense) (None, 1) 33 ================================================================= Total params: 20,171,665 Trainable params: 23,365 Non-trainable params: 20,148,300 _________________________________________________________________ None Train on 16640 samples, validate on 4160 samples Epoch 1/20 16640/16640 [==============================] - 7s 445us/step - loss: 0.6797 - acc: 0.5710 - val_loss: 0.6242 - val_acc: 0.6300 Epoch 2/20 16640/16640 [==============================] - 6s 387us/step - loss: 0.6600 - acc: 0.5939 - val_loss: 0.6510 - val_acc: 0.6221 Epoch 3/20 16640/16640 [==============================] - 7s 392us/step - loss: 0.6447 - acc: 0.6185 - val_loss: 0.6124 - val_acc: 0.6575 Epoch 4/20 16640/16640 [==============================] - 7s 400us/step - loss: 0.6364 - acc: 0.6286 - val_loss: 0.6216 - val_acc: 0.6344 Epoch 5/20 16640/16640 [==============================] - 7s 422us/step - loss: 0.6234 - acc: 0.6389 - val_loss: 0.6134 - val_acc: 0.6510 Epoch 6/20 16640/16640 [==============================] - 7s 442us/step - loss: 0.6157 - acc: 0.6570 - val_loss: 0.6296 - val_acc: 0.6293 Epoch 7/20 16640/16640 [==============================] - 8s 480us/step - loss: 0.6394 - acc: 0.6317 - val_loss: 0.6129 - val_acc: 0.6591 Epoch 8/20 16640/16640 [==============================] - 7s 432us/step - loss: 0.6211 - acc: 0.6605 - val_loss: 0.5977 - val_acc: 0.6868 Epoch 9/20 16640/16640 [==============================] - 7s 425us/step - loss: 0.6355 - acc: 0.6446 - val_loss: 0.6039 - val_acc: 0.6825 Epoch 10/20 16640/16640 [==============================] - 7s 423us/step - loss: 0.6718 - acc: 0.5864 - val_loss: 0.6680 - val_acc: 0.5721 Epoch 11/20 16640/16640 [==============================] - 7s 399us/step - loss: 0.6671 - acc: 0.5906 - val_loss: 0.6392 - val_acc: 0.6356 Epoch 12/20 16640/16640 [==============================] - 7s 405us/step - loss: 0.6587 - acc: 0.6106 - val_loss: 0.6709 - val_acc: 0.5894 Epoch 13/20 16640/16640 [==============================] - 7s 391us/step - loss: 0.6638 - acc: 0.5997 - val_loss: 0.6389 - val_acc: 0.6363 Epoch 14/20 16640/16640 [==============================] - 7s 403us/step - loss: 0.6649 - acc: 0.5847 - val_loss: 0.6387 - val_acc: 0.6221 Epoch 15/20 16640/16640 [==============================] - 7s 411us/step - loss: 0.6619 - acc: 0.5967 - val_loss: 0.6366 - val_acc: 0.6276 Epoch 16/20 16640/16640 [==============================] - 7s 417us/step - loss: 0.6458 - acc: 0.6209 - val_loss: 0.6244 - val_acc: 0.6380 Epoch 17/20 16640/16640 [==============================] - 7s 416us/step - loss: 0.6376 - acc: 0.6284 - val_loss: 0.6166 - val_acc: 0.6474 Epoch 18/20 16640/16640 [==============================] - 7s 432us/step - loss: 0.6319 - acc: 0.6391 - val_loss: 0.6089 - val_acc: 0.6548 Epoch 19/20 16640/16640 [==============================] - 7s 406us/step - loss: 0.6210 - acc: 0.6502 - val_loss: 0.5930 - val_acc: 0.6779 Epoch 20/20 16640/16640 [==============================] - 6s 390us/step - loss: 0.6357 - acc: 0.6319 - val_loss: 0.6679 - val_acc: 0.5308 Wall time: 2min 20s
performance_history(history,model_type,name)
store_model(model_rnn,model_type,name)
model_evaluation(model_rnn,test_text_padded,y_test)
performance_report(model_rnn,test_text_padded,y_test)
Saved model to disk Test loss: 0.6846112473194416 / Test accuracy: 0.5571153846153846 Accuracy: 0.557115 Precision: 0.556318 Recall: 0.963300 F1 score: 0.705310
%%time
## Build Network
model_type = 'GRU'
name = "Model_" + str(epochs)+dash+str(batch_size)+dash+str(max_text_length)+dash+str(vocab_size)+dash
model_gru = build_network_GRU(embeding_layer)
model_gru,history_gru = train_model(model_gru,train_text_padded,y_train,test_text_padded, y_test)
Building GRU network _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 100, 100) 20148300 _________________________________________________________________ gru_1 (GRU) (None, 100) 60300 _________________________________________________________________ dropout_1 (Dropout) (None, 100) 0 _________________________________________________________________ dense_1 (Dense) (None, 32) 3232 _________________________________________________________________ dropout_2 (Dropout) (None, 32) 0 _________________________________________________________________ dense_2 (Dense) (None, 1) 33 ================================================================= Total params: 20,211,865 Trainable params: 63,565 Non-trainable params: 20,148,300 _________________________________________________________________ None Train on 16640 samples, validate on 4160 samples Epoch 1/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.6583 - acc: 0.6156 - val_loss: 0.5851 - val_acc: 0.7067 Epoch 2/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.4903 - acc: 0.7966 - val_loss: 0.3827 - val_acc: 0.8553 Epoch 3/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.3440 - acc: 0.8809 - val_loss: 0.2813 - val_acc: 0.9005 Epoch 4/20 16640/16640 [==============================] - 20s 1ms/step - loss: 0.2347 - acc: 0.9246 - val_loss: 0.1882 - val_acc: 0.9394 Epoch 5/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.1679 - acc: 0.9463 - val_loss: 0.1552 - val_acc: 0.9454 Epoch 6/20 16640/16640 [==============================] - 20s 1ms/step - loss: 0.1401 - acc: 0.9538 - val_loss: 0.1382 - val_acc: 0.9514 Epoch 7/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.1207 - acc: 0.9606 - val_loss: 0.1315 - val_acc: 0.9534 Epoch 8/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.1055 - acc: 0.9639 - val_loss: 0.1238 - val_acc: 0.9550 Epoch 9/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.1167 - acc: 0.9612 - val_loss: 0.1263 - val_acc: 0.9570 Epoch 10/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0903 - acc: 0.9712 - val_loss: 0.1148 - val_acc: 0.9625 Epoch 11/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.0761 - acc: 0.9765 - val_loss: 0.1078 - val_acc: 0.9637 Epoch 12/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0840 - acc: 0.9728 - val_loss: 0.1481 - val_acc: 0.9531 Epoch 13/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.0696 - acc: 0.9777 - val_loss: 0.1164 - val_acc: 0.9632 Epoch 14/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.0572 - acc: 0.9808 - val_loss: 0.0986 - val_acc: 0.9663 Epoch 15/20 16640/16640 [==============================] - 18s 1ms/step - loss: 0.0502 - acc: 0.9839 - val_loss: 0.0972 - val_acc: 0.9678 Epoch 16/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0429 - acc: 0.9869 - val_loss: 0.1017 - val_acc: 0.9673 Epoch 17/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0413 - acc: 0.9874 - val_loss: 0.1095 - val_acc: 0.9644 Epoch 18/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0387 - acc: 0.9877 - val_loss: 0.1058 - val_acc: 0.9702 Epoch 19/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0329 - acc: 0.9907 - val_loss: 0.1030 - val_acc: 0.9697 Epoch 20/20 16640/16640 [==============================] - 19s 1ms/step - loss: 0.0277 - acc: 0.9921 - val_loss: 0.1101 - val_acc: 0.9690 Wall time: 6min 16s
model_type = 'GRU'
performance_history(history_gru,model_type,name)
store_model(model_gru,model_type,name)
model_evaluation(model_gru,test_text_padded,y_test)
performance_report(model_gru,test_text_padded,y_test)
Saved model to disk Test loss: 2.6609848763392523 / Test accuracy: 0.6434615384615384 Accuracy: 0.643462 Precision: 0.689928 Recall: 0.639287 F1 score: 0.663643
%%time
model_type = 'LSTM'
name = "Model_" + str(epochs)+dash+str(batch_size)+dash+str(max_text_length)+dash+str(vocab_size)+dash
## Build Network
model_lstm = build_network_lstm(embeding_layer,lstm_size)
model_lstm,history_lstm = train_model(model_lstm,train_text_padded,y_train,test_text_padded, y_test)
Building Sequential network _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 100, 100) 20148300 _________________________________________________________________ lstm_2 (LSTM) (None, 100) 80400 _________________________________________________________________ dropout_7 (Dropout) (None, 100) 0 _________________________________________________________________ dense_7 (Dense) (None, 32) 3232 _________________________________________________________________ dropout_8 (Dropout) (None, 32) 0 _________________________________________________________________ dense_8 (Dense) (None, 1) 33 ================================================================= Total params: 20,231,965 Trainable params: 83,665 Non-trainable params: 20,148,300 _________________________________________________________________ None Train on 16640 samples, validate on 4160 samples Epoch 1/20 16640/16640 [==============================] - 27s 2ms/step - loss: 0.5902 - acc: 0.7019 - val_loss: 0.5013 - val_acc: 0.7974 Epoch 2/20 16640/16640 [==============================] - 25s 2ms/step - loss: 0.4665 - acc: 0.8170 - val_loss: 0.4329 - val_acc: 0.8276 Epoch 3/20 16640/16640 [==============================] - 27s 2ms/step - loss: 0.4832 - acc: 0.8028 - val_loss: 0.4618 - val_acc: 0.8048 Epoch 4/20 16640/16640 [==============================] - 26s 2ms/step - loss: 0.5670 - acc: 0.6848 - val_loss: 0.6313 - val_acc: 0.5673 Epoch 5/20 16640/16640 [==============================] - 27s 2ms/step - loss: 0.6347 - acc: 0.5864 - val_loss: 0.6186 - val_acc: 0.6175 Epoch 6/20 16640/16640 [==============================] - 26s 2ms/step - loss: 0.6064 - acc: 0.6509 - val_loss: 0.5043 - val_acc: 0.7726 Epoch 7/20 16640/16640 [==============================] - 29s 2ms/step - loss: 0.6012 - acc: 0.6775 - val_loss: 0.6365 - val_acc: 0.5916 Epoch 8/20 16640/16640 [==============================] - 28s 2ms/step - loss: 0.6242 - acc: 0.6132 - val_loss: 0.5846 - val_acc: 0.7245 Epoch 9/20 16640/16640 [==============================] - 29s 2ms/step - loss: 0.5495 - acc: 0.7295 - val_loss: 0.5550 - val_acc: 0.7120 Epoch 10/20 16640/16640 [==============================] - 32s 2ms/step - loss: 0.4220 - acc: 0.8230 - val_loss: 0.3964 - val_acc: 0.8356 Epoch 11/20 16640/16640 [==============================] - 30s 2ms/step - loss: 0.3239 - acc: 0.8739 - val_loss: 0.2773 - val_acc: 0.8938 Epoch 12/20 16640/16640 [==============================] - 28s 2ms/step - loss: 0.2315 - acc: 0.9144 - val_loss: 0.2043 - val_acc: 0.9204 Epoch 13/20 16640/16640 [==============================] - 29s 2ms/step - loss: 0.1811 - acc: 0.9371 - val_loss: 0.1725 - val_acc: 0.9356 Epoch 14/20 16640/16640 [==============================] - 30s 2ms/step - loss: 0.1489 - acc: 0.9487 - val_loss: 0.1612 - val_acc: 0.9387 Epoch 15/20 16640/16640 [==============================] - 35s 2ms/step - loss: 0.1322 - acc: 0.9523 - val_loss: 0.1494 - val_acc: 0.9459 Epoch 16/20 16640/16640 [==============================] - 30s 2ms/step - loss: 0.1184 - acc: 0.9577 - val_loss: 0.1549 - val_acc: 0.9452 Epoch 17/20 16640/16640 [==============================] - 31s 2ms/step - loss: 0.1105 - acc: 0.9619 - val_loss: 0.1532 - val_acc: 0.9490 Epoch 18/20 16640/16640 [==============================] - 30s 2ms/step - loss: 0.0992 - acc: 0.9650 - val_loss: 0.1544 - val_acc: 0.9478 Epoch 19/20 16640/16640 [==============================] - 30s 2ms/step - loss: 0.0990 - acc: 0.9655 - val_loss: 0.1359 - val_acc: 0.9524 Epoch 20/20 16640/16640 [==============================] - 28s 2ms/step - loss: 0.0842 - acc: 0.9709 - val_loss: 0.1301 - val_acc: 0.9543 Wall time: 9min 40s
performance_history(history_lstm,model_type,name)
store_model(model_lstm,model_type,name)
model_evaluation(model_lstm,test_text_padded,y_test)
performance_report(model_lstm,test_text_padded,y_test)
Saved model to disk Test loss: 1.3714693516951342 / Test accuracy: 0.645 Accuracy: 0.645000 Precision: 0.690575 Recall: 0.642782 F1 score: 0.665822
From above experience, GRUs train faster and perform better than LSTMs on less training data.
Reference Paper. Neural GPUs Learn Algorithms Comparative Study of CNN and RNN for Natural Language Processing
RNNs, or Recurrent Neural Networks, are specialized models used for processing sequential data. They are designed to understand how the current outputs are influenced by the preceding inputs, allowing them to capture long-range dependencies. In essence, they possess the ability to retain and utilize information from the past to better interpret present input.
Despite their considerable power in modeling sequential data, training RNNs is notably challenging. This difficulty arises from the fact that RNNs employ Back-Propagation through Time (BPTT) during training, which updates weights not only across layers but also across previous time steps.
There are two primary issues associated with RNNs:
Difficulty Preserving Information: RNNs struggle to retain information across numerous time steps due to their short-term memory.
Vanishing Gradient Problem: The hidden state is continually overwritten, leading to the vanishing gradient problem.
To address these challenges, Memory and Gated Mechanisms are employed, particularly in LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit) models. These mechanisms effectively manage long sequences and their gradients by utilizing multiple gates and memory cells, thereby resolving the problems of preserving information over extended periods and mitigating the vanishing gradient issue.