A New Introduction to Spark 2.1 Dataframes with Python and MLlib

A couple of years ago, when I was in the midst of my rookie year as a data scientist, I wrote a blog post and tutorial about using the Python Spark API to build a simple model from housing data with Spark dataframes. Despite the simple nature of the model (a straight train-test split with multivariate linear regression), it was one of the more challenging tutorials I’ve ever written for this blog. At the time, Spark 1.3 was difficult to set up, difficult to use, and there were few good tutorials out there. Probably partly as a result of others encountering the same difficulties, the post has gotten thousands of views over the last couple of years and generated a number of email discussions. I followed up with a minor update to Spark 1.4, but it has sat there ever since, getting more and more outdated.

I recently decided to update the post using to Spark 2.1 and Databricks Community Edition, a free service from Databricks, which I strongly recommend for users new to Spark. Databricks will allow you to set up a tiny Spark cluster in the cloud which will allow you to begin to learn Spark syntax in any of the core Spark languages (Scala, Java, R, or Python).

As a consequence of Spark maturing so much over the last couple of years and of using Databricks, this tutorial went from being very painful to create to something I could update pretty easily in a couple of hours. In fact, so much syntax was stripped out from the old version to this one that it now seems like a pretty simplistic tutorial. That said, I wanted to create something very close to the original for anyone interesting in seeing how much things have changed. There’s also something to be said for having a simple example to work from when you are getting started with a new framework. I have included a few suggestions at the end of this post for next steps for Spark learners. If you’d like to see the GitHub repo associated with this notebook, you can find it here.


Related articles:

Lucas Allen

Lucas Allen

For more than a decade, Lucas Allen was a high school math teacher and math team coach in Illinois. His 2012 Morton High School math team won the Illinois state championship. Recently, he made the jump from public education to the corporate world and is now working as a data scientist. He is interested in just about all forms of technology, including the TI-Nspire, Nexus devices, R, MOOCs, and more. You can follow , and if you are nice, he will probably follow you back.

More Posts - Website

Follow Me:
TwitterFacebookGoogle Plus

Sorry, comments are closed for this post.