Building the Air Quality Index Charts- 9 mins
Why Air Quality Index Charts?
Living in a city which is notorious for its air pollution, I certainly care about the environment, especially the air quality. Luckily the air is so clean these days, and it made me wonder what’s the chance of seeing a blue sky like this through out one year.
Before I delved into data acquisition, I already knew that it would be difficult to find adequate or accurate data since the monitoring and measurement of air quality in Beijing has been controversial for a while. Besides, related government departments just started to record AQI (or make it public at least), so there is not much historical data.
For example, this site offers detailed hourly AQI data publicly but it doesn’t provide historical data, as described in its documents: if you want historical data, you’d have to scrape it on your own. This is a bummer for sure. And I also looked at Forecast.io that provides API to call, unfortunately it’s just weather data, not exactly what I was looking for. And a bunch of other sites only provide real time presentation like this one. Fortunately I found this site offers historical AQI data. Woohoo!
As I mentioned in earlier posts, in reality, a significant amount of time will be spent on finding and cleaning data before starting to render charts in the browser. This one is no exception.
The first hurdle was that the earliest record is in April 2014 even it says you can select dates back in 2013. But it was still better than nothing, right? The second hurdle was: which set of monitoring site’s data should I use? There are a dozen of monitoring sites located in Beijing, Should I use them all or just one of them? Due to the outcome I had in mind, I decided to use only one site’s data, and honestly, I chose it because it has fewer missing data points. However, this could be misleading in some cases, but that’s enough to get started.
Back to where I started, I wanted to see how many days in Beijing were ‘good days’, and how bad it was on a polluted day.
I thought the calendar view type of chart is a really nice way of showing the ‘good days’ and ‘bad days’ not only because you can get a glimpse of chances of breathing clean air but also I can use colors to indicate how severe the air pollution was on a specific day.
Beyond daily indicator, I was also curious about the hourly data, I assumed that the air quality during night would be worse because factories would release pollution in midnight without supervision during day time.(It turned out to be true for more than 50% of these days). This meant that I also need a line chart/scatterplot to demonstrate hourly data in one day.
The last thing I want to know was the substances in the air, according to this official environment protection standard, the air quality index is calculated based on different pollution substances in the air including SO2, NO2, CO, O3, PM10, PM2.5. I wanted to know general status for each substance, especially PM2.5. Rather than using another line chart or bar chart. I decided to use indicator-alike thingy(not sure if there is a name for it) to show substance data.
Daily Grid Chart
If you look at other calendar view sort of charts, like this alternative calendar view or Days-Hours Heatmap, their timespan is either fixed or the same as a calendar. But in my case, the timespan is from 2014/04/14 to 2015/06/09, and I wanted all daily cube in one place instead of segmented into two parts(2014,2015) since complete data was not available. So I had to differentiate 2014 and 2015 so that the rects won’t overlap:
Scale is Tricky
In other calendar view demos, the segment of each month is the black line wrapper, but in this case, the start date is not January and the end is not December, so I thought perhaps I’d better not using the lines to ‘wrap’ each month. But then how to tell viewers which area correspondents to which month? Obviously time scale was not exactly applicable, because the rects are arranged by week number, it is not quite ‘linear’. More importantly, even if I can come up with an appropriate time scale, it won’t be accurate, for instance, May 1st is the fourth row, then does this column count as Apr or May? Based on this question, I adjusted the goal to show approximately the correspondent month area. So I tried to modify the interpolate function although not very accurate:
It is definitely not the best solution, but it worked somehow, and I came to realize how interesting customized interpolate function can be.
Line chart and scatterplot chart are not new, although three things worth noting are:
To be honest, I didn’t realize the importance of selection update other than transition effect until I used real life dataset in this demo. The key take away is, real life data cannot be perfect, there will be missing data points or event redundant data points sometimes. Using key function in
selection.data can be very handy to handle update selection, enter selection, and exit selection:
Hover on the circle
Another interesting thing is the click event, but I am not gonna talk about the event listener. Say you’ve bind
mouseenter listener to circles with
fill:none;stroke:blue;stroke-width:2px attributes, but nothing happened when you hover on the circle unless the mouse points to the edge of the circle, the blue line, because the fill is none, and you were not hovering the circle technically.
Color Scale Next to Y Axis
The typical legends for heatmap is like the one in this chart. In my case, I need this set of legends to help explain not only the daily grid chart but also the line chart, and the indicator chart. So I thought why not place the color legends next to the Y axis with their length corresponding to the y scale, in that way, viewers can spend less time to decode the meaning of each color in these three charts.
The biggest problem for indicator chart is reusability. The substance index standard is different from each other, thus different scales, but other than that, the rest of settings are the same. It would save a lot of time if the rest of settings are reusable. And generally speaking, reusable components in d3 charts are quite interesting, there are some tools and libraries, such as dc.js or this d3kit from twitter.
Things to Improve
From the perspective of source data utilization, I would love to use more data from different monitoring site, or place them on a map. Furthermore, I think it would be a good idea to run some scripts to automatically collect relevant data so the visualization can be more up-to-date.
From the perspective of code itself, the improvement I want to do the most is reusability for sure.
To recap, 1) data in reality cannot be readily available or perfect, you’ll spend a lot time on data acquisition, data cleaning. 2) before you actually start to write d3 code, think through what kind of presentation you need 3) reusability is crucial.
And here is the link for the building block.