Friday, September 29, 2017

Nested gridding, Hadcrut, and Cowtan/Way .

Update I had made an error in coding for the HADCRUT/C&W example - see below. The agreement with C&W is now much improved.

In my previous post, I introduced the idea of hierarchical, or nested gridding. In earlier posts, eg here and here, I had described using platonic solids as a basis for grids on the sphere that were reasonably uniform, and free of the pole singularities of latitude/longitude. I gave data files for icosahedral hexagon meshes of various degrees of resolution, usually proceeding by a factor of two in cell number or length scale. And in that previous post, I emphasised the simplicity of a scheme for working out which cell a point belonged by finding the nearest centre point. I foreshadowed the idea of embedding each such grid in a coarser parent, with grid averaging proceeding downward, and using the progressive information to supply estimates for empty cells.

The following graph from HADCRUT illustrates the problem. It shows July 2017 temperature anomalies on a 5°x5° grid, with colors for cells that have data, and white otherwise. They average the area colored, and omit the rest from the average. As I often argue, as a global estimate, this effectively replaces the rest by the average value. HADCRUT is aware of this, because they actually average by hemispheres, which means the infilling is done with the hemisphere average rather than global. As they point out, this has an important benefit in earlier years when the majority of missing cells were in the SH, which was also behaving differently, so the hemisphere average is more eappropriate than global. On the right, I show the same figure, but this time with my crude coloring in (with Paint) of that hemisphere average. You can assess how appropriate the infill is:



A much-discussed paper by Cowtan and Way 2013 noted that this process led to bias in that the areas thus infilled tended not to have the behaviour of the average, but were warming faster, and this was underestimated particularly since 2000 because of the Arctic. They described a number of remedies, and I'll concentrate on the use of kriging. This is a fairly elaborate geostatistical interpolation method. When applied, HADCRUT data-based trends increased to be more in line with other indices which did some degree of interpolation.

I think the right way to look at this is getting infilling right. HADCRUT was on the right track in using hemisphere averages, but it should be much more local. Every missing cell should be assigned the best estimate based on local data. This is in the spirit of spatial averaging. The cells are chosen as regions of proximity to a finite number of measurement points, and are assigned an average from those points because of the proximity. Proximity does not end at an artificial cell boundary.

In the previous post, I set up a grid averaging based on an inventory of about 11000 stations (including GHCN and ERSST) but integrated not temperature but a simple function sin(latitude)^2, which should give 1/3. I used averaging omitting empty cells, and showed that at coarse resolution the correct value was closely approximated, but this degraded with refinement, because of the accession of empty cells. I'll now complete that table using nested integration with hexagonal grid. At each successive level, if a cell is empty, it is assigned the average value of the smallest cell from a previous integration that includes it. (I have fixed the which.min error here too; it made little difference).

levelNumcellsSimple averageNested average
11320.32920.3292
221220.33110.3311
332720.32750.3317
444820.32560.3314
5510820.32060.3317
6619220.31670.332
7736320.3110.3313
8876820.30960.3315


The simple average shows that there is an optimum; a grid fine enough to resolve the (small) variation, but coarse enough to have data in most cells. The function is smooth, so there is little penalty for too coarse, but a larger one for too fine, since the areas of empty cells coincides with the function peak at the poles. The merit of the nested average is that it removes this downside. Further refinement may not help very much, but it does no harm, because a near local value is always used.

The actual coding for nested averaging is quite simple, and I'll give a more complete example below.

HADCRUT and Cowtan/Way

Cowtan and Way thankfully released a very complete data set with their paper, so I'll redo their calculation (with kriging) with nested gridding and compare results. They used HADCRUT 4.1.1.0, data ending at end 2012. Here is a plot of results from 1980, with nested integration of the HADCRUT gridded data at centres (but on a hex grid). I'm showing every even step as hier1-4, with hier4 being the highest resolution at 7682 cells. All anomalies relative to 1961-90..


UpdateI had made an error in coding for the HADCRUT/C&W example - see code. I had used which.min instead of which.max. This almost worked, because it placed locations in the cells on the opposite side of the globe, consistently. However, the result is now much more consistent with C&W. With refining, the integrals now approach from below, and also converge much more tightly.

The HADCRUT 4 published monthly average (V4.1.1.0) is given in red, and the Cowtan and Way Version 1 kriging in black. The nested integration makes even more difference than C&W, mainly in the time from 1995 to early 2000's. Otherwise, a As with C&W, it adheres closely to HADCRUT in earlier years, when presumably there isn't much bias associated with the missing data. C&W focussed on the effect on OLS trends, particularly since 1/1997. Here is a table, in °C/Century:

Trend 1997-2012Trend 1980-2012
HAD 40.4621.57
Hier10.9021.615
Hier20.9291.635
Hier30.9671.625
Hier40.9561.624
C&W krig0.971.689


Convergence is very good to the C&W trend I calculate. In their paper, for 1997-2012 C&W give a trend of 1.08 °C/Cen (table III) which would agree very well with the nested results. C&W used ARMA(1,1) rather than OLS, but the discrepancy seems too large for that.Update: Kevin Cowtan has explained the difference in a comment below.

Method and Code

This is the code for the integration of the monthly sequence. I'll omit the reading of the initial files and the graphics, and assume that we start with the HADCRUT 4.1.1.0 gridded 1980-2012 data reorganised into an array had[72,36,12,33] (lon,lat,month,year). The hexmin[[]] lists are as described and posted previously. The 4 columns of $cells are the cell centres and areas (on sphere). The first section is just making pointer lists from the anomaly data into the grids, and from each grid into its parent. If you were doing this regularly, you would store the pointers and just re-use as needed, since it only has location data. The result is the gridpointer and datapointer lists. The code takes a few seconds.
monthave=array(NA,c(12,33,8)) #array for monthly averages 
datapointer=gridpointer=cellarea=list(); g=0;
for(i in 1:8){ # making pointer lists for each grid level
 g0=g;  # previous g
 g=as.matrix(hexmin[[i]]$cells); ng=nrow(g);
 cellarea[[i]]=g[,4]
 if(i>1){ # pointers to coarser grid i-1
  gp=rep(0,ng)
  for(j in 1:ng)gp[j]=which.max(g0[,1:3]%*%g[j,1:3])
  gridpointer[[i]]=gp
 }
 y=inv; ny=nrow(y); dp=rep(0,ny) #  y is list of HAD grid centres in 3D cartesian
 for(j in 1:ny) dp[j]=which.max(g[,1:3]%*%y[j,])
 datapointer[[i]]=dp  # datapointers into grid i
}
Update: Note the use of which.max here, which is the key instruction locating points in cells. I had originally used which.min, which actually almost worked, because it places ponts on the opposite side of the globe, and symmetry nearly makes that OK. But not quite. Although the idea is to minimise the distance, that is implemented as maximising the scalar product.

The main data loop just loops over months, counting and adding the data in each cell (using datapointer); forming a cell average. It then inherits from the parent grid values (for empty cells) from the parent average vector using gridpointer to find the match, so at each ave is complete. There is an assumption that the coarsest level has no empty cells. It is then combined with area weighting (cellarea, from hexmin) for the monthly average. Then on to the next month. The result is the array monthave[month, year, level] of global averages.
for(I in 1:33)for(J in 1:12){ # looping over months in data from 1980
 if(J==1)print(Sys.time())
 ave=rep(NA,8)  # initialising
 #tab=data.frame(level=ave,Numcells=ave,average=ave)
 g=0
 for(K in 1:8){ # over resolution levels
  ave0=ave
  integrand=c(had[,,J,I+130])  # Set integrand to HAD 4 for the month 
  area=cellarea[[K]]; 
  cellsum=cellnum=rep(0,length(area))  # initialising
  dp=datapointer[[K]]
  for(i in 1:n){  # loop over "stations"
   ii=integrand[i]
   if(is.na(ii))next  # no data in cell
   j=dp[i]
   cellsum[j]=cellsum[j]+ii
   cellnum[j]=cellnum[j]+1
  }
  j=which(cellnum==0) # cells without data
  gp=gridpointer[[K]]
  if(K>1)for(i in j){cellnum[i]=1;cellsum[i]=ave0[gp[i]]}
  ave=cellsum/cellnum # cell averages
  Ave=sum(ave*area)/sum(area) # global average (area-weighted)
  if(is.na(Ave))stop("A cell inherits no data")
  monthave[J,I,K] = round(Ave,4) # weighted average
 }
}# end I,J

Data

Moyhuhexmin has the hex cell data and was given in the earlier post. I have put a new zipped ascii version here



9 comments:

  1. The Cowtan&Way correction is subtle as I found back in 2013

    http://contextearth.com/2013/11/19/csalt-with-cw-hybrid/

    ReplyDelete
  2. Hi Nick

    The difference between your 0.97 and the 1.08 C/decade trend from the paper is subtle. If I krig the the distributed HadCRUT4 data, I do indeed get a trend of about 0.97 C/decade.

    However in CW14, we did an extra step, which was to first rebaseline the data to 1980-2010. That makes the difference. Moving the baseline to be close to the period for which you want the trend reduces the impact of changes in coverage during the trend period. (Coverage bias obeys a product rule such that either a change in coverage or a change in the obs-unobs temperature contrast can introduce a change in bias.)

    This was the most important benefit of moving to the version 2 reconstructions, where we reconstruct the land and ocean temperatures separately and then combine. When we do that the trends are pretty much independent of the choice of baseline.

    (That's the gist of it - I can't guarantee I've remembered it all correctly. The updates on the website are more authoritative.)

    ReplyDelete
    Replies
    1. Thanks, Kevin. Since I'm just testing integration methods here, I think 0.97 is the right one to compare. Now that I've fixed my silly which.min error, it seems to compare pretty well.

      Delete
  3. You could run it on satellite or reanalysis data with a HadCRUT4-like mask applied to see how well recreates known trends.

    ReplyDelete
    Replies
    1. It should recreate any trend in data because it is essentially the data (apart from any interpolation, but that does not define the trend).

      Delete
  4. How well does it recreate known results with partial data? Helps to constrain errors, especially for time periods of poor coverage or datasets with few measurements (radiosondes).

    I guess his sin(latitude)^2 experiment accomplishes a similar task but much simpler.

    ReplyDelete
  5. Have you tried spatial averaging on sparse data in the land-ocean data set eg from 1850-1900? I think everyone agrees that local infilling is much more optimal on post-1950 data sets, but not much analysis has been done on the sparser data in the early record.

    This is important to constrain global T since pre-industrial baseline (say 1850-1879 as in Haustein et al).

    ReplyDelete
    Replies
    1. Deep,
      I haven't done much with spatial averaging over that period. Most of my regular integration is post 1900. I did do a systematic test of increasing sparsity on a single month, which I described here. Triangular mesh coverage gives integration fairly robust to major thinning of stations. But that is optimised - how well it would survive the particular sparsity of 1850-1900 is another matter. I'll try it.

      Delete