A couple of weeks ago, I got my first experience with Apache Spark. While I have yet to implement it with any meaningful problems, in my experience when working with a new tool or technology, just getting one’s feet wet can be crucial to getting a learning snowball rolling. Although Spark is primarily used for “big data” problems in data clusters, I have been experimenting with a very “small data” problem, a simple linear regression on California home prices. You can find the data set here. I’ve decided to put the resulting tutorial up on Tech Powered Math. Although there is nothing earth shattering in this post, I think some people will find it helpful for the following reasons.
- The method I used for working with the data is dataframes. Dataframes are a relatively new paradigm in Spark. They have only been available since Spark 1.3 in February 2015.
- I am using the Python API. While I suspect that PySpark is going to grow rapidly in popularity, there seem to be more resources for Scala at this time.
- I could find very few tutorials or even significant Q&A threads about using PySpark syntax and dataframes on Stack Overflow. That gives me cause to believe that even this simple tutorial about reading a CSV into Spark, doing some trivial data wrangling with dataframes, and performing a linear regression could be helpful to some individuals.
The code in the Jupyter Spark notebook below was completed with Spark 1.31.