The one billion row challenge has taken the world by a storm. Today we are going to see what is the reason behind this storm…
Ok. Enough with the theatrics. I recently came across the 1 billion row challenge through a video on Primeagen. And instantly I wanted to try it out myself. I really liked the iterative approach taken by the author of the original article.
I have recently started working on Go and I thought this would be a good way to get my hands dirty.
First things first
We need to create the dataset first. The original repo provided scripts in Python and Java to generate the data. Using the Python script I was able to generate 1 billion rows in approx 6 minutes.
I want to approach this in an iterative manner. Starting with the most basic version, I will just read the entire file in one go and process it in a single loop.
Initial attempt
This approach was an utter failure. The code did the following,
- Read the entire file into memory
- Split the string by new lines
- Split each new line by semi colon
- Convert the temperature readings to float32
- Store these readings in a slice of structs
- Run a loop over the slice and create a list of aggregated readings
- Sort the aggregated readings by station name
Easy Peasy. The problem? The process got killed before it could parse all the lines. Some quick logging showed that splitting the content by new lines took 1 minute 24 seconds and then it was able to parse 620 million lines before it died a vain death. Reading the entire file took 18 seconds but then at around 5 minutes mark the process was killed. I tried running the script on just 10 million rows and that worked. It took 2.27 seconds for the whole thing.
$ go run .
Read in 25.837260834s
Split in 1m24.98397775s
0 million parsed in 1m24.984116542s
10 million parsed in 1m39.661467084s
20 million parsed in 1m42.114975042s
30 million parsed in 1m44.627674834s
40 million parsed in 1m47.506549709s
50 million parsed in 1m50.009006375s
60 million parsed in 1m52.586154834s
70 million parsed in 1m55.26839675s
80 million parsed in 1m57.691080834s
90 million parsed in 2m0.460094709s
100 million parsed in 2m2.953258209s
Next shot
The naive attempt didnt work. It was now time for a little less naive attempt. The plan was to read a line process it right away. So the major change from the previous approach was that instead of parsing the one billion lines and storing them in an array we will directly aggregate them. So in essence we save the space required for the array.
[contd…]