Modern Data Science Tools: Getting Started with Databricks and Spark
Published:
This post gives a simple introduction to popular modern data science technologies. In this post, we will explore features of the Databricks platform and distributed computing with Spark using data that is freely available in the Databricks Free Edition.
Some features covered in this post include:
- running interactive code in Notebooks using the Databricks Free Edition platform
- querying the Unity Catalog data lakehouse with PySpark and SQL
- using PySpark to efficiently access and manipulate data
What is Databricks?
Databricks is a popular data science platform that facilitates everything from efficient exploratory data analysis to model building and deployment. Databricks can be hosted by popular cloud service providers, including Amazon (AWS), Microsoft (Azure), and Google (GCP), and allows for seamless connectivity to serverless and server computing, cloud storage, and a variety of IDEs (e.g., PyCharm, RStudio, VS Code). The Databricks Free Edition of Databricks can be accessed simply by creating an account with an email, and it provides free access to many of the features available in Databricks.
Databricks also has a powerful Notebook capability that supports Python, R, Scala, and SQL languages. Databricks notebooks offer similar interactivity and interface as Jupyter notebooks, but are hosted on the Databricks platform and therefore have seamless access to connected cloud services. This makes real-time analysis of data and development of complex models extremely fast and easy.
One of the primary benefits of Databricks is that it is built on Spark.
What is Spark?
Spark is a popular framework that facilitates distributed processing for working with big data. Spark is most commonly used within Scala or Python (PySpark) and allows for lazy data manipulation and efficient processing. One great benefit of Spark is that it allows for easy processing of big datasets that do not fit in memory.
Getting Started with Databricks Notebooks
Databricks notebooks are the native solution for working with data in Databricks. Inside a notebook, there are cells that can run R, Python, Scala, SQL, or Markdown code.
Running code is as simple as typing some code in a cell and clicking the Run Cell button (or ⌘+Enter/Cntrl+Enter).
Accessing the Unity Catalog
The Unity Catalog (UC) is a data lakehouse where structured data is stored in Delta Tables and can easily be accessed directly in Notebooks by SQL or Spark queries.
Start by navigating to the Unity Catalog:
In the Databricks Free Edition, there are some free tables that can be accessed right away. In UC, tables are organized in the following structure: database
.schema
.table
. For example, samples
.bakehouse
.sales_customers
. Alternatively, data can easily be uploaded to a database and schema of choice.
UC tables can be directly queries within a notebook using SQL or (Py)Spark queries:
Basics of PySpark
PySpark is the Python API for Spark. Tables from UC can be accessed by referencing the table and storing it as a Spark DataFrame:
Once a Spark DataFrame has been created, various PySpark functions can be used to execute SQL code, compute statistics, fit models, etc. For example, we can count the number of customers in each continent using SQL code, executed by PySpark, using spark.sql()
or using the PySpark API directly
Common PySpark Functions
Some common PySpark functions I use on a daily basis to manipulate Spark DataFrames (df
) include:
df.count()
: returns the number of rows indf
df.show()
: printsdf
df.display()
: displaysdf
in an interactive tabledf.filter()
: filter rows ofdf
using SQL or PySpark syntax.df.filter("quantity >= 15 and paymentMethod is not null")
df.filter((col("quantity") >= 15) & (col("paymentMethod").isNotNull()))
- although the syntax is different, the results are equivalent
df.select()
: select columns to keep indf
df.drop()
: choose columns to drop fromdf
df.join()
: joindf
to anotherdf
on one or more columns. Allows for various types of joins (e.g., left, right, inner, outer, anti)df.orderBy()
: order values by one or more columnsdf.dropDuplicates()
: remove all duplicate rows (or rows with duplicate values of specified columns)df.dropna()
: remove all rows with missing valuesdf.withColumnsRenamed()
: rename columns indf
using a dictionary following{'old_name': 'new_name'}
syntax. A version for working with one column at a timedf.withColumnRenamed()
also exists.df.withColumns()
: Create/modify columns ofdf
using a dictionary with specified columns and operations. A version for working with one column at a timedf.withColumn()
also exists.df.withColumns{'price_squared': col("totalPrice")**2, 'price_sqroot': sqrt(col("totalPrice")}
df.groupBy()
: groupsdf
by one or more columns to perform aggregation (i.e., compute min, max, or other statistics)df.groupBy("customerID").agg({"totalPrice": "sum"}).withColumnRenamed('sum(totalPrice)', 'overall_totalPrice')
Demonstrating PySpark on Bakehouse
Data
In the Databricks Free Edition, there are free bakehouse
tables available in UC that contain customer, transaction, and franchise data. At least in my verion of those tables, there are errors in the ID columns, and those errors are resolved by downloading the free “Cookies Dataset DAIS 2024” dataset that can be found by searching in the “Marketplace” in Databricks.
In the code below, I access the tables, do some filtering and joins, and perform some aggregation to determine the number of transactions, average quantity per transaction, and total amount spent by a subset of customers.
Conclusion
As you can see, working in Databricks is straightforward, and using PySpark to work with UC is seamless. The benefits are even larger for big datasets that do not fit in memory. Databricks has even more great built-in features that will be explored in a future blog post.