Using word2vec to Analyze News Headlines and Predict Article Success

As part of my efforts to learn in public earlier on in my data science journey, I wrote this article on an end-to-end analysis I did on a dataset of news headlines (apologies, I can’t find the original dataset, but I got it from the UCI ML Repository.)

The article includes:

  • Preprocessing/cleaning the text data, using NLTK
  • Using word2vec to create word and title embeddings, then visualizing them as clusters using t-SNE
  • Visualizing the relationship between title sentiment and article popularity
  • Attempting to predict article popularity from the embeddings and other available features, using XGBoost (gradient-boosted trees)
  • Using model stacking (ensembling) to improve the performance of the popularity model (this step was not successful, but was still a valuable experiment!)

The full text of the article (with code snippets and a link to the Jupyter Notebook) is here.