When I first started working on the climate analysis program, I knew that I would be working on a larger than memory dataset, but what I didn’t fully realize is how much more complexity is incurred when working with large datasets…
Keep it Simple!
I downloaded the daily summaries in the GHCN dataset a few weeks ago and saw that it was around 6GB compressed. I thought I would try to keep it simple, stupid, and tried to unzip it… until it filled the rest of the space on my computer’s SSD and was taking longer than 24 hours to unzip. This obviously wasn’t even close to optimal so I started Googling around.
Digging into the Details
I decided on the H5 file format as it uses compression (unlike databases) and the PyTables library added functionality that could come in useful. It took me multiple iterations to figure out how to convert the tar file that I downloaded containing CSV’s of climate data into an H5 file in a timely fashion. That is how big this dataset is (60GB compressed as an H5 file). I also found Dask, a library that can be used for larger-than-memory computing, perfect! I recently created a Github repo for my program in its current state. I am in the very early stages of development at this point, but I do have a function that takes in a tar file and produces a H5 file, which I think is a good start.