Unlock SQLite: Easy Data Analysis for Busy Software Engineers
Episode #19: Let me help you navigating the myriad of tools and solutions for data analysis.
If you are a busy software engineer, you can't spend weeks thinking about the best solution for the job and then learning those tools.
Let me help make this choice much more manageable.
In this article, we will discuss SQLite, which I believe is the best tool for data analysis for relatively small datasets.
The article is divided into the following sections:
When is SQLite the best tool for the job?
How to get started with SQLite
Extensions and tools for SQLite
Want to connect?
👉 Follow me on LinkedIn and Twitter.
If you need 1-1 mentoring sessions, please check my Mentorcruise profile.
When is SQLite the best tool for the job?
You are a busy Software engineer who needs to analyse a relatively small dataset (e.g., 30-100MB of CSV or JSON logs) for a new project.
What tools should you use?
Any big data tools like Spark or Hadoop would be an overkill for such a small dataset.
Setting up a Python project with a Pandas data frame is the best option in most cases, however, depending on the complexity of the analysis, that might even be overkill.
As I explained in a previous article called Effortless Python Development with Nix, setting up a Python project comes with some complexity.
You need to set up a virtual environment and install dependencies in a reproducible way. If you are sharing this project with someone else, you need to provide documentation on how to get started.
Using Nix makes onboarding new people, or even automating the setup process, a lot easier, but at the cost of installing Nix on your laptop.
Using command-line tools like awk and grep is definitively more straightforward to set up, but those tools may not provide enough flexibility. You also need to install those tools. Nix can also help you in this case, but you still have to install Nix.
I'm a huge fan of command-line tools; I use them all the time in my daily job.
I found a fascinating book on this topic called data science at the command line.
The book is now free, so I recommend reading it. While I enjoyed reading it, I don't agree with all the tools presented in this book. For example, I am not a fan of Makefile. If you want to learn more about this topic, I wrote an article at Taskfile: a modern alternative to Makefile. Having said that, knowing those tools is highly beneficial when learning about command line tools.
Given that, as most software engineers, you know SQL, you could probably load all your data into a MySql or PostegreSQL database. You could install those databases locally or use managed services. You are trading installation time for the cost of paying for cloud services.
Is there a more straightforward solution that doesn't involve complex setup or cloud costs?
What else can you do?
If only we could come up with a solution that is powerful, flexible, easy to onboard, doesn't require cloud costs, and doesn't feel like an overkill for such a small dataset and simple analysis.
Am I asking too much?
I think I have a solution to this problem.
Discover SQLite:
SQLite is the most common DBMS in the world, shipped with all popular operating systems.
SQLite is serverless.
For developers, SQLite is embedded directly into the app.
For everyone else, there is a convenient database console (REPL), provided as a single file (sqlite3.exe on Windows, sqlite3 on Linux / macOS).
Those quotes are from an excellent article called SQLite is not a toy database.
How to get started with SQLite
The console is a killer SQLite feature for data analysis: more powerful than Excel and simpler thanÂ
pandas
. One can import CSV data with a single command, and the table is created automatically:
> .import --csv city.csv city
> select count(*) from city;
1117
The quote and code are from this excellent article SQLite is not a toy database.
There are plenty of resources on how to get started with SQLite. I don't think I'll provide helpful information by repeating the same information here, so I'll give some good starting points below.
Some other learning resources:
Extensions and tools for SQLite
As we discussed before, SQLite is not a toy database. Sometimes, though, it can feel like it is not as powerful as well-known alternatives like MySQL or PostgreSQL.
In the last few years, many extensions and tools have been developed to extend SQLite's functionality.
To name a few extensions:
sqlite-xsv: fast extension for CSV files
sqlite-url: functions to work with URLs.
sqlean: extension pack with some missing functions like math functions, cryptographic algorithms, regex support and much more.
All those extensions are libraries that must be downloaded locally and then loaded into the database at runtime. For more information on how to load a SQLite extension, see Runtime Loadable Extensions.
Other than extensions, there are also some tools which can improve your experience working with SQLite:
sqlite-utils: library and command-line utility that helps create SQLite databases from an existing collection of data
sqlpkg-cli: manages SQLite extensions, just likeÂ
pip
 does with Python packages orÂbrew
 does with macOS programs. This tool allows you to install those SQLite extensions that we mentioned before (and many more) from a SQLite package registry.
I have created a GitHub repo with some Nix files, Taskfile and a Readme to make the experience of installing those extensions much more pleasant.