I’ve got a new Shiny web app that I’ve embedded on another site where I’m doing some experimental things, and I wanted to talk generally about how I created it. The web app can be found at the following link that allows the user to do interactive searches for similar classic games for home consoles from what are generally known as the third generation (NES, Sega Master System) through the sixth generation (Wii, PS2, Xbox). The similarity of games was determined based on text used to describe those games in Wikipedia articles. It was not a small project to put it together, and I got distracted numerous times through its creation, so this actually came together over several months.
Here’s my workflow at a high level:
Gather Game Lists/Links
Wikipedia contributors helpfully offer lists for just about every classic game system (example: Nintendo Entertainment System). Once I decided on a dozen or so systems to include, I easily had thousands of potential games to include. Unfortunately, there’s very little consistency as to how the game tables are constructed. This meant that to gather the links to individual game pages required a custom Beautiful Soup script for each page with a list of games.
Retrieve Game Data
Once the individual game pages had been determined, I needed the individual game page data. Luckily, this turned out to be a bit easier. Rather than use Beautiful Soup, the Wikipedia API was a simpler alternative. While using on page text wasn’t a perfect solution (some games in a series share the same Wikipedia page while other games point to the movie they are derived from), the vast majority of games do have a standalone page that tells the story of the game, as well as something about its development and history.
Create Word Embeddings
I used the Python module Gensim to create the word embeddings that I’d later use to rank the game similarity. I’ve been using Gensim for about a year now, but this still required some experimentation. I considered the following possibilities:
- A pretrained Word2Vec model with average word vectors
- A newly trained Word2Vec model with average word vectors
- A newly trained Doc2Vec model
Ultimately, I went with option 3. I have found Doc2Vec models difficult to train; my searches online suggest I’m not alone in this. However, during the course of this project, I found a paper on Doc2Vec that changed my approach and results. This topic could be an entire blog post, but the TLDR version is that the authors explain that it is critical to crank up the number of training epochs with Doc2Vec, typically into the hundreds. To understand this better, I suggest giving their paper a read or perusing their repo on this topic. In particular, this little snippet from their train.py file is gold as a starting point for looking for those golden Doc2Vec hyperparameters:
With only a few tweaks to these parameters to parallelize on more cores and take the number of epochs a bit higher (300), and about 12 hours of waiting, I had a model that was giving intuitive results.
Move to R/Shiny
Up to this point, everything I’d done had been in Python. However, I wanted a Shiny app to share my results. To make this work, I saved all of the document vectors I’d created to a CSV. In addition to creating the Shiny app itself, I needed an efficient way to do the cosine similarity calculation. Gensim offers that capability, but I was giving that up by moving to R. I needed to vectorize that calculation, and while I’ve gotten pretty comfortable with those sorts of broadcast calculations in Numpy in recent months, I haven’t had a need to do vectorized array operations in R. One solution turns out to be R’s “sweep” function.
Ultimately, the Shiny app itself is fairly simple. I haven’t built one in in a while now, and it gave me a chance to use a few features that are a little more recent to Shiny’s release history. I built with the Shinydashboard library, which allows for a great way way to responsively lay out the widgets within your Shiny app that feels more “dashboard-like.” I also added a few responsive info boxes, and I was pleased to see that Rstudio has integrated a ton of icons to choose from, including gaming icon that was appropriate for this app.
So if you’re curious, check out the results in the Shiny app. Not every result is completely intuitive, but most top results are, and some are very cool, like when Doc2Vec picks up on the fact that Castlevania and Van Helsing are both vampire hunter games.
- Back2School with Vectors, Cosine Similarity, and Word2Vec
- Data Science Capstone Review
- Reproducible Research Coursera Review