As part of my efforts to learn in public earlier on in my data science journey, I wrote this article on an end-to-end analysis I did on a dataset of news headlines (apologies, I can’t find the original dataset, but I got it from the UCI ML Repository.)
The article includes:
- Preprocessing/cleaning the text data, using NLTK
- Using word2vec to create word and title embeddings, then visualizing them as clusters using t-SNE
- Visualizing the relationship between title sentiment and article popularity
- Attempting to predict article popularity from the embeddings and other available features, using XGBoost (gradient-boosted trees)
- Using model stacking (ensembling) to improve the performance of the popularity model (this step was not successful, but was still a valuable experiment!)
The full text of the article (with code snippets and a link to the Jupyter Notebook) is here.