Category Archives: Data

Retro Game Retrieval Engine Design

Retro Game Retrieval Engine Design

I’ve got a new Shiny web app that I’ve embedded on another site where I’m doing some experimental things, and I wanted to talk generally about how I created it. The web app can be found at the following link that allows the user to do interactive searches for similar classic games for home consoles from what are generally known as the third generation (NES, Sega Master System) through the sixth generation (Wii, PS2, Xbox). The similarity of games was determined based on text used to describe those games in Wikipedia articles. It was not a small project to put it together, and I got distracted numerous times through its creation, so this actually came together over several months.

Here’s my workflow at a high level:

Gather Game Lists/Links

Wikipedia contributors helpfully offer lists for just about every classic game system (example: Nintendo Entertainment System). Once I decided on a dozen or so systems to include, I easily had thousands of potential games to include. Unfortunately, there’s very little consistency as to how the game tables are constructed. This meant that to gather the links to individual game pages required a custom Beautiful Soup script for each page with a list of games.

Retrieve Game Data

Once the individual game pages had been determined, I needed the individual game page data. Luckily, this turned out to be a bit easier. Rather than use Beautiful Soup, the Wikipedia API was a simpler alternative. While using on page text wasn’t a perfect solution (some games in a series share the same Wikipedia page while other games point to the movie they are derived from), the vast majority of games do have a standalone page that tells the story of the game, as well as something about its development and history.

Create Word Embeddings

I used the Python module Gensim to create the word embeddings that I’d later use to rank the game similarity. I’ve been using Gensim for about a year now, but this still required some experimentation. I considered the following possibilities:

  1. A pretrained Word2Vec model with average word vectors
  2. A newly trained Word2Vec model with average word vectors
  3. A newly trained Doc2Vec model

Ultimately, I went with option 3. I have found Doc2Vec models difficult to train; my searches online suggest I’m not alone in this. However, during the course of this project, I found a paper on Doc2Vec that changed my approach and results. This topic could be an entire blog post, but the TLDR version is that the authors explain that it is critical to crank up the number of training epochs with Doc2Vec, typically into the hundreds. To understand this better, I suggest giving their paper a read or perusing their repo on this topic. In particular, this little snippet from their file is gold as a starting point for looking for those golden Doc2Vec hyperparameters:

With only a few tweaks to these parameters to parallelize on more cores and take the number of epochs a bit higher (300), and about 12 hours of waiting, I had a model that was giving intuitive results.

Move to R/Shiny

Up to this point, everything I’d done had been in Python. However, I wanted a Shiny app to share my results. To make this work, I saved all of the document vectors I’d created to a CSV. In addition to creating the Shiny app itself, I needed an efficient way to do the cosine similarity calculation. Gensim offers that capability, but I was giving that up by moving to R. I needed to vectorize that calculation, and while I’ve gotten pretty comfortable with those sorts of broadcast calculations in Numpy in recent months, I haven’t had a need to do vectorized array operations in R. One solution turns out to be R’s “sweep” function.

Ultimately, the Shiny app itself is fairly simple. I haven’t built one in in a while now, and it gave me a chance to use a few features that are a little more recent to Shiny’s release history. I built with the Shinydashboard library, which allows for a great way way to responsively lay out the widgets within your Shiny app that feels more “dashboard-like.” I also added a few responsive info boxes, and I was pleased to see that Rstudio has integrated a ton of icons to choose from, including gaming icon that was appropriate for this app.

So if you’re curious, check out the results in the Shiny app. Not every result is completely intuitive, but most top results are, and some are very cool, like when Doc2Vec picks up on the fact that Castlevania and Van Helsing are both vampire hunter games.

Machine Learning Specialization Cut Short by Coursera

After an extremely long wait, today was the day that the fifth course in Coursera’s Machine Learning Specialization was set to begin. I’ve been with this specialization since it launched in the fall of 2015. Students were initially promised an ambitious slate of six courses, including a capstone that would wrap up by early summer of… Continue Reading

Minivan Price Comparison With R

Minivan Price Comparison With R

With my family growing once again and my 13-year-old Mazda Protégé on the fritz, I recently decided it was time to go minivan shopping. A frugal shopper, some might say cheap, I quickly set my focus on the used, domestic market and found that there are only two competitors here, the Dodge Grand Caravan and the… Continue Reading

University of Washington Machine Learning Classification Review

I’ve spent the last couple of months working through course three in the University of Washington’s Machine Learning Specialization on Coursera. Course two was regression (review); the topic of the third course is classification. As has been the case with previous courses, this specialization continues to be taught by Carlos Guestrin and Emily Fox. For… Continue Reading

Graphing Calculator Price Dashboard

Graphing Calculator Price Dashboard

These interactive plots show the prices on Amazon for popular Texas Instruments calculators such as the TI-Nspire CX (review) and TI-84 Plus CE (review) as well as non-TI models like the Casio Prizm (review) and HP Prime (review). The graphs show the last 7 days, and they update every hour, day and night, so check… Continue Reading

Coursera Review–Machine Learning: Regression

Coursera Review–Machine Learning: Regression

I’ve recently completed the second course in the University of Washington Machine Learning Specialization on Coursera, “Machine Learning: Regression.” This comes on the heels of completing course 1, Machine Learning Foundations: A Case Study Approach. This course debuted right at the end of November and wrapped up 6 weeks later (my impression is that these… Continue Reading

Constructing a Social Graph With Twitter and Plotly

Constructing a Social Graph With Twitter and Plotly

In a couple of earlier posts, I showed an example of a social graph created from Twitter data and Plotly, a graph of relationships between educational technology enthusiasts on Twitter. Those posts were more for the educator audience that I write for, but increasingly, I’m getting feedback on my posts from other data scientists, so… Continue Reading