A New Introduction to Spark 2.1 Dataframes with Python and MLlib

By Lucas | May 3, 2017

A couple of years ago, when I was in the midst of my rookie year as a data scientist, I wrote a blog post and tutorial about using the Python Spark API to build a simple model from housing data with Spark dataframes. Despite the simple nature of the model (a straight train-test split with multivariate linear regression), it was one of the more challenging tutorials I’ve ever written for this blog. At the time, Spark 1.3 was difficult to set up, difficult to use, and there were few good tutorials out there. Probably partly as a result of others encountering the same difficulties, the post has gotten thousands of views over the last couple of years and generated a number of email discussions. I followed up with a minor update to Spark 1.4, but it has sat there ever since, getting more and more outdated.

I recently decided to update the post using to Spark 2.1 and Databricks Community Edition, a free service from Databricks, which I strongly recommend for users new to Spark. Databricks will allow you to set up a tiny Spark cluster in the cloud which will allow you to begin to learn Spark syntax in any of the core Spark languages (Scala, Java, R, or Python).

As a consequence of Spark maturing so much over the last couple of years and of using Databricks, this tutorial went from being very painful to create to something I could update pretty easily in a couple of hours. In fact, so much syntax was stripped out from the old version to this one that it now seems like a pretty simplistic tutorial. That said, I wanted to create something very close to the original for anyone interesting in seeing how much things have changed. There’s also something to be said for having a simple example to work from when you are getting started with a new framework. I have included a few suggestions at the end of this post for next steps for Spark learners. If you’d like to see the GitHub repo associated with this notebook, you can find it here.