Category Archives: Featured

Databricks Review

Databricks Review



Not too long ago, I did my first post on Apache Spark, a Spark dataframes tutorial. I’ve continued to experiment with Spark since taking my first tentative steps with it just a few months ago. One of the challenges with Spark is that it has a reputation for being difficult to deploy at scale. Stepping in to try to solve that problem is Databricks. Databricks offers the ability for corporations to deploy an optimized Spark via the cloud with some very nice extra bells and whistles.

I took advantage of the Databricks 1-month trial to see how I liked their product, and I came away impressed in my brief time with it. Databricks was founded by the team that created Spark and contributed 75% of the code base to Spark in the last year, according to their website. Consequently, my primary hope was that it would be simple to get up and running with Spark at peak efficiency against a Hadoop cluster on Databricks, so that I could spend my time manipulating data and not fiddling with Spark configuration.


I certainly found that to be the case. When signing up for a Databricks account, I provided access to my Amazon Web Services (AWS) account. This gives Databricks the ability to have Spark access data in any S3 buckets I have created. After logging into Databricks, it’s a matter of a few clicks to have Databricks create a Hadoop cluster for Spark to work with. This gets billed to your AWS account via EC2. If you are doing short term work that can risk being interrupted, you can use “spot instances” to keep the costs down. I was able to work on an 9 node cluster with 30 GB of memory and 4 cores per node for a couple of hours for about $1, which is the cost efficient magic of cloud computing, something I’m still pretty new to.


It took about 10 minutes for my cluster to launch.  Once your cluster is launched, its status in the Databricks console changes from “Pending” to “Running.” At this point, you have can restart or terminate the cluster from the console at any time or head into the notebook to do some analysis.


Databricks does offer its own flavor of scientific computing notebooks for your Spark code. The only notebooks I’ve used prior to working with Databricks notebooks are Jupyter notebooks, primarily for Python coding. The Databricks notebooks have a sleeker appearance by comparison, and just feel a little more refined. I didn’t love the method for moving cells up and down (something I have to do often in notebooks), which required me to open and navigate a menu in the corner of the cell I wanted to move. On the other hand, being able to easily save the output of any cell to a CSV is really nice.


Since I didn’t have any S3 buckets readily full of data of the size that you would normally explore with Spark, I tried a few of the data sets that Databricks makes available for practicing with Spark. Some of these are pretty small (many of the data sets included with R are on there), but there are several that are quite large. I did some simple munging with a history of airline flight data, which had close to 1 billion rows. While I didn’t build any sophisticated models using MLlib, the work I did have Spark do–including building a dataframe of the data set, persisting the data in memory with the .cache() method building, doing queries and frequency counts on subsets of the data–they all executed flawlessly and quickly. It took a few minutes to cache the data and seconds to run the queries after that.

There are many other features that Databricks offers that I didn’t even get much into. You can build visualizations, and in fact, they offer integration with third party apps like Tableau and Qlik


I am quite impressed with Databricks and really enjoyed taking it for a spin, but for now I won’t be continuing beyond the trial. Currently, their plans start at about $250 a month, which is completely reasonable by enterprise standards, but too pricey for an individual like myself looking to increase my knowledge of Spark and do hobby projects in my free time. I think Databricks has created an interesting situation because they have created an incredibly powerful tool for businesses to solve big data problems. However, in the process of creating a BI tool, they have made what could also be a very powerful platform for beginners learning Spark and individuals looking to do research projects. Making Spark so easy to get started with could broaden its audience even further, because several individuals I have spoken with have talked about the challenges of getting over the initial difficulties of setup and configuration. Having communicated via email with some of the Databricks team, they have said they are considering the possibility of a more affordable, “developer” tier, for individual users with less ambitious Spark goals. Hopefully, those plans will materialize to make Databricks available to an even wider audience.

My First Month With Ubuntu

  My journey into data science is taking me all sorts of interesting places that I didn’t originally expect. That’s what I love about it. While I can feel myself accelerating into the learning curve, there’s no shortage of new things to learn and won’t be for years to come. One of the latest has… Continue Reading

TI-84 Plus CE Week

TI-84 Plus CE Week

It’s TI-84 Plus CE week here at Tech Powered Math. I’ve had a couple of weeks to interact with the new TI-84 Plus CE. I also recently had a nice informational chat with Texas Instruments’ reps about some TI-84 Plus CE news and how the TI-84 product line is evolving. All that added up to more than I… Continue Reading

Texas Instruments Launches STEM Behind Health

Texas Instruments Launches STEM Behind Health

It recently came to my attention that Texas Instruments launched a new initiative called STEM Behind Health. STEM Behind Health is designed to get students excited about health related careers. Activities from the initiative include a TI-Nspire document file as well as student worksheets and teacher notes. They were developed in conjunction with health care… Continue Reading

Favorite Podcasts for Data Scientists

One of my favorite learning methods is via podcasts. They allow me to multitask–exercising, driving, or doing chores–while listening to experts on a particular topic. Some of the podcasts I listen to are purely for entertainment (think Serial or StartUp) but many others are for educational purposes. As I’ve been trying to build up my… Continue Reading

My MOOC Study Strategies

My MOOC Study Strategies

If you’ve looked into MOOCs (Massive Online Open Courses) at all, you have probably wondered how successful students are at completing them compared to traditional courses. The short answer? Not very. I’ve seen various numbers floating around in a variety of studies, citing completion rates as low as 4% and as high as 8%, but… Continue Reading