geofileops.select_two_layers#

geofileops.select_two_layers(input1_path: Union[str, os.PathLike[Any]], input2_path: Union[str, os.PathLike[Any]], output_path: Union[str, os.PathLike[Any]], sql_stmt: str, input1_layer: Optional[str] = None, input1_columns: Optional[List[str]] = None, input1_columns_prefix: str = 'l1_', input2_layer: Optional[str] = None, input2_columns: Optional[List[str]] = None, input2_columns_prefix: str = 'l2_', output_layer: Optional[str] = None, explodecollections: bool = False, force_output_geometrytype: Optional[GeometryType] = None, gridsize: float = 0.0, where_post: Optional[str] = None, nb_parallel: int = 1, batchsize: int = -1, force: bool = False)#

Execute a SELECT SQL statement on the input files.

The sql_stmt must be in SQLite dialect and can contain placeholders that will be replaced automatically. More details can be found in the notes and examples below.

The result is written to the output file specified.

Parameters:
  • input1_path (PathLike) – the 1st input file.

  • input2_path (PathLike) – the 2nd input file.

  • output_path (PathLike) – the file to write the result to.

  • sql_stmt (str) – the SELECT SQL statement to be executed. Must be in SQLite dialect.

  • input1_layer (str, optional) – input layer name. Optional if the file only contains one layer. Defaults to None.

  • input1_columns (List[str], optional) – list of columns to retain if one of the {layer1_columns_…} placeholders is used in sql_stmt. If None, all standard columns are retained. In addition to standard columns, it is also possible to specify “fid”, a unique index available in all input files. Note that the “fid” will be aliased even if input1_columns_prefix is “”, eg. to “fid_1”. Defaults to None.

  • input1_columns_prefix (str, optional) – prefix to use in the column aliases. Defaults to “l1_”.

  • input2_layer (str, optional) – input layer name. Optional if the file only contains one layer. Defaults to None.

  • input2_columns (List[str], optional) – list of columns to retain if one of the {layer2_columns_…} placeholders is used in sql_stmt. If None is specified, all columns are selected. As explained for input1_columns, it is also possible to specify “fid”. Defaults to None.

  • input2_columns_prefix (str, optional) – prefix to use in the column aliases. Defaults to “l2_”.

  • output_layer (str, optional) – output layer name. If None, the output_path stem is used. Defaults to None.

  • explodecollections (bool, optional) – True to convert all multi-geometries to singular ones after the dissolve. Defaults to False.

  • force_output_geometrytype (GeometryType, optional) – The output geometry type to force. Defaults to None, and then the geometry type of the input1 layer is used.

  • gridsize (float, optional) – the size of the grid the coordinates of the ouput will be rounded to. Eg. 0.001 to keep 3 decimals. Value 0.0 doesn’t change the precision. Defaults to 0.0.

  • where_post (str, optional) – SQL filter to apply after all other processing, including e.g. explodecollections. It should be in sqlite syntax and spatialite reference functions can be used. Defaults to None.

  • nb_parallel (int, optional) – the number of parallel processes to use. If -1, all available cores are used. Defaults to 1. If nb_parallel != 1, make sure your query still returns correct results if it is executed per batch of rows instead of in one go on the entire layer.

  • batchsize (int, optional) – indicative number of rows to process per batch. A smaller batch size, possibly in combination with a smaller nb_parallel, will reduce the memory usage. Defaults to -1: (try to) determine optimal size automatically.

  • force (bool, optional) – overwrite existing output file(s). Defaults to False.

Notes

By convention, the sql_stmt can contain following placeholders that will be automatically replaced for you:

  • {input1_layer}: name of input layer 1

  • {input1_geometrycolumn}: name of input geometry column 1

  • {layer1_columns_prefix_str}: komma seperated columns of layer 1, prefixed with “layer1”

  • {layer1_columns_prefix_alias_str}: komma seperated columns of layer 1, prefixed with “layer1” and with column name aliases

  • {layer1_columns_from_subselect_str}: komma seperated columns of layer 1, prefixed with “sub”

  • {input1_databasename}: the database alias for input 1

  • {input2_layer}: name of input layer 1

  • {input2_geometrycolumn}: name of input geometry column 2

  • {layer2_columns_prefix_str}: komma seperated columns of layer 2, prefixed with “layer2”

  • {layer2_columns_prefix_alias_str}: komma seperated columns of layer 2, prefixed with “layer2” and with column name aliases

  • {layer2_columns_from_subselect_str}: komma seperated columns of layer 2, prefixed with “sub”

  • {layer2_columns_prefix_alias_null_str}: komma seperated columns of layer 2, but with NULL for all values and with column aliases

  • {input2_databasename}: the database alias for input 2

  • {batch_filter}: the filter to be applied per batch when using parallel processing

Example: left outer join all features in input1 layer with all rows in input2 on join_id.

import geofileops as gfo

minimum_area = 100
sql_stmt = f"""
    SELECT layer1.{{input1_geometrycolumn}}
          {{layer1_columns_prefix_alias_str}}
          {{layer2_columns_prefix_alias_str}}
      FROM {{input1_databasename}}."{{input1_layer}}" layer1
      LEFT OUTER JOIN {{input2_databasename}}."{{input2_layer}}" layer2
           ON layer1.join_id = layer2.join_id
     WHERE 1=1
       {{batch_filter}}
       AND ST_Area(layer1.{{input1_geometrycolumn}}) > {minimum_area}
"""
gfo.select_two_layers(
    input1_path=...,
    input2_path=...,
    output_path=...,
    sql_stmt=sql_stmt,
)

Some important remarks:

  • Because some SQL statements won’t give the same result when parallelized (eg. when using a group by statement), nb_parallel is 1 by default. If you do want to use parallel processing, specify nb_parallel + make sure to include the placeholder {batch_filter} in your sql_stmt. This placeholder will be replaced with a filter of the form “AND rowid >= x AND rowid < y”.

  • Table names are best double quoted as in the example, because some characters are otherwise not supported in the table name, eg. “-“.

  • When using supported placeholders, make sure you give the tables you select from the appropriate table aliases (layer1, layer2).

  • Besides the standard sqlite SQL syntacs, you can use the spatialite functions as documented here: spatialite reference

  • It is supported to use attribute tables (= table without geometry column) as input layers and/or not to include the geometry column in the selected columns. Note though that if the column placeholders are used (e.g. {layer1_columns_prefix_str}), they will start with a “,” and if no column precedes it the SQL statement will be invalid.

Examples

An ideal place to get inspiration to write you own advanced queries is in the following source code file: geofileops_sql.py.

Additionally, there are some examples listed here that highlight other features/possibilities.

Join nearest features with filter

To join nearest features, geofileops has a specific join_nearest() function. This provides a fast way to find the nearest feature(s) if there doesn’t need to be a filter on the features to be found.

For a use case where for each element in layer 1, you want to find the nearest features in layer 2 while applying a filter that eliminates many kandidates, the query below will be a better solution.

Note: Using MIN(ST_Distance(layer1.geom, layer2.geom) sometimes seems to round the distances calculated slightly resulting in some nearest features not being found. Using RANK avoids this issue.

sql_stmt = f"""
    WITH join_with_dist AS (
        SELECT layer1.{{input1_geometrycolumn}} AS geom
              {{layer1_columns_prefix_alias_str}}
              {{layer2_columns_prefix_alias_str}}
              ,ST_Distance(
                    layer1.{{input1_geometrycolumn}},
                    layer2.{{input2_geometrycolumn}}
               ) AS distance
              ,RANK() OVER ( PARTITION BY layer1.rowid ORDER BY ST_Distance(
                                layer1.{{input1_geometrycolumn}},
                                layer2.{{input2_geometrycolumn}}
                           )
               ) AS pos
          FROM {{input1_databasename}}."{{input1_layer}}" layer1
          JOIN {{input2_databasename}}."{{input2_layer}}" layer2
            ON layer1.join_id = layer2.join_id
         WHERE 1=1
           {{batch_filter}}
    )
    SELECT *
      FROM join_with_dist jwd
     WHERE pos = 1
"""