Blog Post, Climate Analysis, Programming

Starting Out: Climate Change Analysis

When I first started working on the climate analysis program, I knew that I would be working on a larger than memory dataset, but what I didn’t fully realize is how much more complexity is incurred when working with large datasets…

Keep it Simple!

I downloaded the daily summaries in the GHCN dataset a few weeks ago and saw that it was around 6GB compressed. I thought I would try to keep it simple, stupid, and tried to unzip it… until it filled the rest of the space on my computer’s SSD and was taking longer than 24 hours to unzip. This obviously wasn’t even close to optimal so I started Googling around.

Digging into the Details

I decided on the H5 file format as it uses compression (unlike databases) and the PyTables library added functionality that could come in useful. It took me multiple iterations to figure out how to convert the tar file that I downloaded containing CSV’s of climate data into an H5 file in a timely fashion. That is how big this dataset is (60GB compressed as an H5 file). I also found Dask, a library that can be used for larger-than-memory computing, perfect! I recently created a Github repo for my program in its current state. I am in the very early stages of development at this point, but I do have a function that takes in a tar file and produces a H5 file, which I think is a good start.

Blog Post, Climate Analysis, Programming

Where is all the Data on Climate Change?

Science and the pursuit of understanding is not something that is reserved only for scientists with advanced degrees and the resources that only a research institution can provide. Anyone can perform their own analysis on whether the Earth is warming or not, it is just a matter of knowing where to look.

I thought that it would be an interesting project to try to do a simple analysis of climate data using publicly available data (what a time to be alive!) on the subject. After looking at multiple datasets, I chose the GHCND (Global Historical Climatology Network Daily) dataset as it has many years of data from all around the globe. This blessing can be a curse as well though, The dataset downloaded is several gigs large, and uncompressed it is much, much larger than that, although, uncompressing it isn’t strictly necessary. If you want to look at other data sets though, Erika Wise has a great list to pick from.

This will be the first post in a series about a python program that I am developing to analyze climate data.