geofileops.select#

geofileops.select(input_path: Union[str, os.PathLike[Any]], output_path: Union[str, os.PathLike[Any]], sql_stmt: str, sql_dialect: Optional[Literal['SQLITE', 'OGRSQL']] = 'SQLITE', input_layer: Optional[str] = None, output_layer: Optional[str] = None, columns: Optional[List[str]] = None, explodecollections: bool = False, force_output_geometrytype: Optional[Union[GeometryType, str]] = None, gridsize: float = 0.0, keep_empty_geoms: bool = True, nb_parallel: int = 1, batchsize: int = -1, force: bool = False)#

Execute a SELECT SQL statement on the input file.

The sql_stmt must be in SQLite dialect and can contain placeholders that will be replaced automatically. More details can be found in the notes and examples below.

The result is written to the output file specified.

Parameters:
  • input_path (PathLike) – the input file

  • output_path (PathLike) – the file to write the result to

  • sql_stmt (str) – the SELECT SQL statement to execute

  • sql_dialect (str, optional) – the SQL dialect to use. If None, the default SQL dialect of the underlying source is used. Defaults to “SQLITE”.

  • input_layer (str, optional) – input layer name. Optional if the input file only contains one layer.

  • output_layer (str, optional) – input layer name. Optional if the input file only contains one layer.

  • columns (List[str], optional) – list of columns to retain, if {columns_to_select_str} is used. If None, all standard columns are retained. In addition to standard columns, it is also possible to specify “fid”, a unique index available in all input files. Note that the “fid” will be aliased eg. to “fid_1”. Defaults to None.

  • explodecollections (bool, optional) – True to convert all multi-geometries to singular ones. Defaults to False.

  • force_output_geometrytype (GeometryType, optional) – The output geometry type to force. Defaults to None, and then the geometry type of the input is used

  • gridsize (float, optional) – the size of the grid the coordinates of the ouput will be rounded to. Eg. 0.001 to keep 3 decimals. Value 0.0 doesn’t change the precision. Defaults to 0.0.

  • keep_empty_geoms (bool, optional) – True to keep rows with empty/null geometries in the output. Defaults to True.

  • nb_parallel (int, optional) – the number of parallel processes to use. If -1, all available cores are used. Defaults to 1. If nb_parallel != 1, make sure your query still returns correct results if it is executed per batch of rows instead of in one go on the entire layer.

  • batchsize (int, optional) – indicative number of rows to process per batch. A smaller batch size, possibly in combination with a smaller nb_parallel, will reduce the memory usage. If batchsize != -1, make sure your query still returns correct results if it is executed per batch of rows instead of in one go on the entire layer. Defaults to -1: (try to) determine optimal size automatically.

  • force (bool, optional) – overwrite existing output file(s). Defaults to False.

Notes

By convention, the sqlite query can contain following placeholders that will be automatically replaced for you:

  • {geometrycolumn}: the column where the primary geometry is stored.

  • {columns_to_select_str}: if columns is not None, those columns, otherwise all columns of the layer.

  • {input_layer}: the layer name of the input layer.

  • {batch_filter}: the filter used to process in parallel per batch.

Hint: often you will want to use f”” formatting on the SQL statement to fill out some parameters of your own as well. You can easily escape the placeholders above by doubling the “{” and “}”, e.g. use {{geometrycolumn}} for {geometrycolumn}. Also check out the example below.

Example: buffer all rows with a certain minimum area to the output file.

import geofileops as gfo

minimum_area = 100
sql_stmt = f"""
    SELECT ST_Buffer({{geometrycolumn}}, 1) AS {{geometrycolumn}}
          {{columns_to_select_str}}
      FROM "{{input_layer}}" layer
     WHERE 1=1
       {{batch_filter}}
       AND ST_Area({{geometrycolumn}}) > {minimum_area}
"""
gfo.select(
    input_path=...,
    output_path=...,
    sql_stmt=sql_stmt,
)

Some important remarks:

  • Because some SQL statements won’t give the same result when parallelized (eg. when using a group by statement), nb_parallel is 1 by default. If you do want to use parallel processing, specify nb_parallel + make sure to include the placeholder {batch_filter} in your sql_stmt. This placeholder will be replaced with a filter of the form “AND rowid >= x AND rowid < y”.

  • The name of the geometry column depends on the file format of the input file. E.g. for .shp files the column will be called “geometry”, for .gpkg files the default name is “geom”. If you use the {geometrycolumn} placeholder, geofileops will replace it with the correct column name for the input file.

  • If you apply (spatialite) functions on the geometry column always alias them again to its original column name, e.g. with “AS {geometrycolumn}”.

  • Some SQL statements won’t give correct results when parallelized/ran in multiple batches, e.g. when using a group by statement. This is why the default value for nb_parallel is 1. If you want to parallelize or run the query in multiple batches (by specifying nb_parallel != 1 or batchsize > 0), you should make sure your query will give correct results if it is executed per batch of rows instead of once on the entire layer. Additionally, if you do so, make sure to include the placeholder {batch_filter} in your sql_stmt. This placeholder will be replaced with a filter of the form “AND rowid >= x AND rowid < y” and will ensure every row is only treated once.

  • Table names are best double quoted as in the example, because some characters are otherwise not supported in the table name, eg. ‘-‘.

  • It is recommend to give the table you select from “layer” as alias. If you use the {batch_filter} placeholder this is even mandatory.

  • When using the (default) “SQLITE” SQL dialect, you can also use the spatialite functions as documented here: spatialite reference.

  • It is supported to use an attribute table (= table without geometry column) as input layer and/or not to include the geometry column in the selected columns. Note though that if the {columns_to_select_str} placeholder is used, it will start with a “,” and if no column precedes it the SQL statement will be invalid.