This project was completed as part of Metis data science bootcamp program in March 2021. The blog post walks through how I incorporated web scraping, NLP, and recommender systems in Python to create an educational web app for beginners in stocks investing.
Motivation
When you are just getting into a new field, it is normal that you seek education on that subject. For example, if you want to learn data science, you take a bootcamp like Metis and practice coding on Leetcode. You want to get up to speed on home improvement skills — you watch DIY videos. When you invest in a property like a house, you consult a real estate broker, but more importantly, you do on-site due diligence to see for yourself that there are no unpleasant surprises.
But when it comes to stock investing, many simply refuse to become educated and risk their savings on some hot tips from a friend, and start becoming obsessed with daily stock price movements without actually learning.
So I asked myself, how can I help beginners get properly educated on stock investing? Books are the obvious first place to go but many people nowadays don’t want to pore into hundreds of pages of text (sadly). How about classes or courses? They are structured and interactive, but most of the time, it costs money, and is time consuming. You could also read blogs or articles on investing but there is no structure to it so you don’t know where to start.
As a solution, I decided to build a YouTube video recommender that lets the user choose different investment topics to study, and recommends high quality videos with examples and case studies for better learning.
Collecting and preprocessing video text data
I used Selenium to search about 3000 unique YouTube videos on stock investing using different search queries and scraped the following text data:
- Video ID
- Title
- Upload date and duration
- Views and number of likes
- Description
- Transcript
Then, I performed text preprocessing on video titles and transcripts using SpaCy and NLTK.
- Removing stopwords
- Lemmatization
- POS tagging and named entity recognition (e.g., company tickers)
Filtering out videos with no educational value
The main challenge of my project lied in filtering out videos with no educational value. Videos with irrelevant topics such as cryptocurrency were easy to remove using topic modeling. Videos with stock-related topics but providing little educational value, such as news, and short-term trading, were identified to a certain extent using Word2Vec embeddings.
Word2Vec, combined with sentiment analysis were also helpful in weeding out promotional or misleading videos, like those shown in the last row of the table below. Being an avid reader/listener of Warren Buffett’s books and interviews myself, it is highly unlikely that he gave a prediction about any stock making 10x your money, let alone a penny stock. So I clearly do not want a beginner to watch these types of videos and start developing unrealistic investment goals.
Example — Using Word2Vec and sentiment analysis to filter out videos on short-term trading
As an example, I am going to show how Word2Vec embeddings and sentiment analysis were used to filter out videos on short-term trading. As a basic rule, short-term trading based on price movements is NOT investing but many have misconception that it is. So I want to remove all videos that promote this concept. But at the same time, I don’t want to remove videos just for mentioning short-term trading, that may be telling viewers not to follow this approach.
Here is where Word2Vec comes in. I have created a T-SNE plot below that shows embeddings of words that appear in similar contexts. The red circle shows all the words related to ‘trading’ are clustered together.
## Grab word from word, POS tuples and put them into a list
word_list = [[word[0] for word in doc] for doc in df['Transcript']]## Initialize a Word2Vec model and set parameters
model = word2vec.Word2Vec(word_list, min_count=900, window=5, ns_exponent=-10)labels = []
tokens = []
for word in model.wv.vocab:
tokens.append(model[word])
labels.append(word)## Perform dimensionality reduction using TSNE
tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
new_values = tsne_model.fit_transform(tokens)x = []
y = []
for value in new_values:
x.append(value[0])
y.append(value[1])## Visualize Word2Vec using matplotlib
plt.figure(figsize=(16, 10))
for i in range(len(x)):
plt.scatter(x[i],y[i])
plt.annotate(labels[i], xy=(x[i], y[i]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom')
It is hard to see on the chart, so if we run Word2Vec’s most similar method, which uses cosine similarity, it gives us words like break, level, support, which are all phrases a YouTuber teaching short-term trading would use.
I used some of these words as a mask to identify videos that attempt to teach viewers about short-term trading as an investment approach. I then used Vader’s sentiment analysis to find videos with very high positive sentiment (scores of >0.9, with 1 being the most positive), because I believed those are more likely to be promotional.
If we look at the results table below, the first two videos are strictly about short-term trading, even though the second video’s title mentions long-term investing, which is misleading. The fourth and fifth videos, as you can see are promotional videos that exude very positive sentiment but again, is highly misleading.
Then for the third video, it appears from the title that the video is about the fundamental aspects of stock investing, like looking at competition, and economic moats — things we want to learn about. But their transcript reveals that the speaker spends most of the video teaching viewers how to study past patterns in short-term price movements to make the next move! So I was able to capture these bad videos to vastly improve my final topic model and recommender system.
Final topic model
My topic model ultimately consisted of nine investment topics, ordered in a logical sequence so that you learn the basics of different investment styles before delving into the tools used in analyzing stocks.
Creating a two-step content-based recommender
In creating my recommender system, I had a major cold start problem where a new user is assumed to have no previous interaction with YouTube videos on stock investing. As a solution, I created a two-step content-based recommender — an initial recommender to let a user gauge his topic of interest and desired video style, and a follow-up recommender to make recommendations based on the user’s liked videos. Using cosine similarity on CountVectorizer gave the best results, especially for cross-topic recommendations, based on my own domain knowledge.
def follow_up_recommender(liked_video_list, df_videos_filtered):
'''
Input: A list of video ID(s) of a user's liked video(s), and a cleaned dataframe of all videos put into the initial recommender
Output: Top five follow-up recommendations using content-based recommender system
'''
## Fit and transform the transcript into a document-term matrix
word_list = [doc for doc in df_videos_filtered['Transcript']]
vec = CountVectorizer(tokenizer=lambda doc:doc, lowercase=False)
matrix = vec.fit_transform(word_list).toarray() ## Generate a similarity matrix
similarity_matrix = cosine_similarity(matrix, matrix) ## Create a series of titles for the videos (rows)
df_videos_filtered = df_videos_filtered.reset_index(drop=True)
indices = pd.Series(df_videos_filtered['Title']) ## Get the indices of the user's liked video(s)
idx_list = []
for liked_video in liked_video_list:
idx_list.append(indices[indices == liked_video].index[0]) ## Create a dataframe of the similarity matrix, but only showing columns for the liked videos
scores_list = []
for idx in idx_list:
scores_list.append(similarity_matrix[idx]) scores_df = pd.DataFrame(scores_list).T
scores_df.columns = idx_list ## Drop videos that were in the initial recommendation
scores_df.drop([0,1,2,3,4], inplace=True) ## Calculate the mean cosine similarity score for each video
mean_score_series = scores_df.mean(axis='columns').sort_values(ascending=False) ## Get the indices of the five highest scores
similarity_indices = list(mean_score_series.index)
top_5_indices = similarity_indices[:5] ## Populate a dataframe of the recommended videos
df_videos_follow_up_recs = df_videos_filtered.iloc[top_5_indices] return df_videos_follow_up_recs[['Video_ID','Title']]
Finally, I put my recommender inside a Streamlit app and deployed it publicly via Heroku. You can visit here to try out my app!
For more information on this project, including code and presentation slides, please check out my GitHub repository here.
If you would like to share ideas about any of my projects, please do not hesitate to contact me on LinkedIn.