xarray指南：合并数据 - 多维度合并

April 16, 2020 (最后修改: August 21, 2021)

本文翻译自 xarray 官方文档 Combining data 的部分内容。

本文介绍如何使用 xarray 实现多纬度合并。

注意

当前有三个名称相似的合并函数：auto_combine()，combine_by_coords() 和 combine_nested()。这是因为不建议使用 auto_combine，而推荐使用其它两个更通用的函数。如果当前代码依赖 auto_combine，那么可以使用 combine_nested 获得相似的功能。

沿多维度合并

为了沿多个维度合并多个对象，xarray 提供 combine_nested() 和 combine_by_coords()。这些函数在不同变量中组合使用 concat 和 merge，将多个对象合并为一个。

combine_nested() 需要指定变量合并的顺序，而 combine_by_coords() 尝试从数据的坐标中自动推断合并顺序。

当事先知道每个对象之间的空间关系时，combine_nested() 很有用。数据集必须按照嵌套列表形式提供，指示它们的相对位置和顺序。一个常见的任务是从并行仿真中收集数据，其中每个处理器将数据写入到单独的文件中。将一个域分解为 4 个部分（沿 x 和 y 轴各 2 个部分），需要将数据集组织成一个双嵌套列表，例如：

arr = xr.DataArray(
    name='temperature', 
    data=np.random.randint(5, size=(2, 2)), 
    dims=['x', 'y'],
)
arr

<xarray.DataArray 'temperature' (x: 2, y: 2)>
array([[3, 4],
       [0, 3]])
Dimensions without coordinates: x, y

ds_grid = [[arr, arr], [arr, arr]]
xr.combine_nested(ds_grid, concat_dim=['x', 'y'])

<xarray.DataArray 'temperature' (x: 4, y: 4)>
array([[3, 4, 3, 4],
       [0, 3, 0, 3],
       [3, 4, 3, 4],
       [0, 3, 0, 3]])
Dimensions without coordinates: x, y

combine_nested() 还可用于显式合并不同变量的数据集。例如如果我们有 4 个数据集，它们被划分为两个时间并包含两个不同的变量，则可以向 concat_dim 传递 None，指定我们希望使用 merge 的嵌套列表中的维度，而不是 concat：

temp = xr.DataArray(
    name='temperature', 
    data=np.random.randn(2), 
    dims=['t'],
)
temp

<xarray.DataArray 'temperature' (t: 2)>
array([ 0.13493113, -0.42346127])
Dimensions without coordinates: t

precip = xr.DataArray(
    name='precipitation', 
    data=np.random.randn(2), 
    dims=['t'],
)
precip

<xarray.DataArray 'precipitation' (t: 2)>
array([ 0.67876484, -1.57851655])
Dimensions without coordinates: t

ds_grid = [[temp, precip], [temp, precip]]
xr.combine_nested(ds_grid, concat_dim=['t', None])

<xarray.Dataset>
Dimensions:        (t: 4)
Dimensions without coordinates: t
Data variables:
    temperature    (t) float64 0.1349 -0.4235 0.1349 -0.4235
    precipitation  (t) float64 0.6788 -1.579 0.6788 -1.579

combine_by_coords() 用于合并包含维度坐标的对象，这些维度坐标指定了彼此之间的关系和相对顺序，例如线性增长的“时间”维度坐标。

这里我们将使用通用的维度坐标合并两个数据集。注意，它们是根据维度坐标的顺序合并的，而不是根据传递给 combine_by_coords 的列表中的位置。

x1 = xr.DataArray(
    name='foo', 
    data=np.random.randn(3), 
    coords=[('x', [0, 1, 2])]
)
x2 = xr.DataArray(
    name='foo', 
    data=np.random.randn(3), 
    coords=[('x', [3, 4, 5])]
)
xr.combine_by_coords([x2, x1])

<xarray.Dataset>
Dimensions:  (x: 6)
Coordinates:
  * x        (x) int64 0 1 2 3 4 5
Data variables:
    foo      (x) float64 -0.3269 -0.9649 0.2854 -1.323 1.458 0.8652

xr.combine_nested([x2, x1], concat_dim=["x"])

<xarray.DataArray 'foo' (x: 6)>
array([-1.3227153 ,  1.4577656 ,  0.86516473, -0.32694684, -0.96494207,
        0.28541309])
Coordinates:
  * x        (x) int64 3 4 5 0 1 2

open_mfdataset() 使用这些函数将多个文件作为一个数据集打开。通过将参数 combine 设置为 by_coords 或 nested 来指定所使用的特定函数。这在将数据分散在多个位置的多个文件中并且彼此之间具有某种已知关系的情况下很有用。

实战

从 3 个时效的 GRIB 2 文件中加载温度场数据。

获取文件列表。

from nwpc_data.data_finder import find_local_file

hours = [pd.Timedelta(hours=h) for h in (0, 3, 6)]
file_paths = [
    find_local_file(
        "grapes_gfs_gmf/grib2/orig",
        start_time="2020031800",
        forecast_time=h
    ) for h in hours
]
for p in file_paths:
    print(p)

/sstorage1/COMMONDATA/OPER/NWPC/GRAPES_GFS_GMF/Prod-grib/2020031721/ORIG/gmf.gra.2020031800000.grb2
/sstorage1/COMMONDATA/OPER/NWPC/GRAPES_GFS_GMF/Prod-grib/2020031721/ORIG/gmf.gra.2020031800003.grb2
/sstorage1/COMMONDATA/OPER/NWPC/GRAPES_GFS_GMF/Prod-grib/2020031721/ORIG/gmf.gra.2020031800006.grb2

使用 xr.open_mfdataset 批量加载，使用 nested 方式连接 step 维度。

ds = xr.open_mfdataset(
    file_paths,
    engine="cfgrib",
    backend_kwargs={
        "filter_by_keys": {
            "shortName": "t", 
            "typeOfLevel": "isobaricInhPa"
        }, 
        'indexpath': ''
    },
    combine='nested',
    concat_dim='step',
)
ds

<xarray.Dataset>
Dimensions:        (isobaricInhPa: 36, latitude: 720, longitude: 1440, step: 3)
Coordinates:
    time           datetime64[ns] 2020-03-18
  * isobaricInhPa  (isobaricInhPa) int64 1000 975 950 925 900 850 ... 5 4 3 2 1
  * latitude       (latitude) float64 89.88 89.62 89.38 ... -89.38 -89.62 -89.88
  * longitude      (longitude) float64 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
  * step           (step) timedelta64[ns] 00:00:00 03:00:00 06:00:00
    valid_time     (step) datetime64[ns] 2020-03-18 ... 2020-03-18T06:00:00
Data variables:
    t              (step, isobaricInhPa, latitude, longitude) float32 dask.array<chunksize=(1, 36, 720, 1440), meta=np.ndarray>
Attributes:
    GRIB_edition:            2
    GRIB_centre:             babj
    GRIB_centreDescription:  Beijing 
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             Beijing 
    history:                 2020-04-16T13:47:31 GRIB to CDM+CF via cfgrib-0....

笔者尚未找到合适的参数配置，实现加载多个时次的多个时效数据。

参考

http://xarray.pydata.org/en/stable/combining.html

https://github.com/nwpc-oper/nwpc-data