Invalid Geometries

Every GIS tech has run into these annoying demons.  They can be puzzling because the geospatial world uses some geometries that normal people may not know about.  They are often called geoms, too, just to be extra confusing.  All of them are constructed from a list of one or more points like a connect-the-dots drawing.  This concept is called vector graphics because the word vector means "carrier" in Latin — the shape is "carried" from point to point.  How this string of points is stored is what matters.  The order and position of points are what cause problems.

This post will rely on the Open Geospatial Consortium (OGC) for its definition of validity but, to understand what's really going on, let's keep it simple.  Although geometries are normally stored and processed as huge blobs of 1s and 0s, they can also be expressed as text.  KML, GeoJSON and WKT are common formats for doing that.  The WKT syntax (Well-Known Text) is very readable and should help clarify the situation.  Here's how it works:

Each point or vertex in a 2-D geometry is a Cartesian coordinate expressed as an X and Y value.  The X value always comes first.  (Geographers tend to think "lat-long" but that's actually backwards.)  There is a space between the X and Y values and commas delimit the individual vertices.  Each segment of the geometry is enclosed in parentheses and the entire thing is preceded by the name of the geometry type.  The primitive types are pretty easy:

POINT (30 10)
LINESTRING (30 10, 10 30, 40 40)
POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))

A LINESTRING is just a line made from a string of points.  A POLYGON is enclosed in double parentheses and its final point is the same as its first.  This repeated point closes the polygon.  The double parentheses might seem redundant until you encounter a polygon with a hole in it.  That looks like this:

POLYGON ((35 10, 45 45, 15 40, 10 20, 35 10), (20 30, 35 35, 30 20, 20 30))

Unlike points and lines, polygons have an interior.  They are always bounded by an outside perimeter and can have zero to many internal voids.  The list of points (or vertices) that define the external and any internal boundaries are called rings.  This is where the order of the points becomes important.  The external ring should be in counter-clockwise order.  Any internal rings should be in clockwise order.  This preserves the convention that the inside of the polygon is always on the left as each ring is drawn.  This clarifies where the polygon should be filled and where it should not.

Multi-part geometries are also common and they look like this:

MULTIPOINT ((10 40), (40 30), (20 20), (30 10))
MULTILINESTRING ((10 10, 20 20, 10 40), (40 40, 30 30, 40 20, 30 10))
MULTIPOLYGON (((30 20, 45 40, 10 40, 30 20)), ((15 5, 40 10, 10 20, 5 10, 15 5)))

These are all single geometry objects with multiple parts.  The same format and rules that apply to primitives apply to each part of a multi-geometry.  Polygons are actually the only geometry type that can truly be invalid.  Here are the rules set by the OGC standard:

  • all rings must close
  • interior rings must be entirely enclosed by the exterior ring
  • rings may not cross over each other nor themselves
  • rings may only touch each other at a single point

A ring that crosses over itself is self-intersecting.  This is another case of points being out of order.  The example below shows a self-intersection polygon and identifies the problematic vertices.  Was this supposed to have been an indentation in the exterior ring, or a separate interior ring with another vertex at the intersection? 

POLYGON ((35 10, 45 45, 15 40, 10 20, 25 15, 35 35, 20 30, 35 10))

If you think of a polygon as a slice of Swiss cheese, these rules make perfect sense.  All of the edges, inside and out, have to be complete.  The holes can't overlap or be twisted.  And a hole that overlaps the outside edge isn't really a hole, it's a part of the edge.  A polygon that tries to do otherwise is obviously invalid!

I should mention that WKT supports several additional geometry types including curves, circles and regular polygons in which control parameters define the shape rather than individual vertices — but that is beside the point (so to speak).  WKT is only being used here to illustrate the vector concept.

Lines can also suffer from self-intersections.  Although this is allowed by the standard, it may not be desirable.  Spatial functions and calculations are fairly demanding and malformed data can cause inaccurate results, hinder performance, bloat log files and hang or crash applications.  Such things are also very difficult to replicate and troubleshoot.

An invalid geometry error is actually a good thing.  It's a clear signal that something needs to be fixed and the remedy can be fairly painless.  QGIS is a great tool for the task.  It's a free, open-source GIS editor that supports most common geospatial databases and file formats.  It includes a geometry checker that can find self-intersections and duplicate vertices, and it will show you exactly which ones need attention.  QGIS also has a topology checker which can identify issues with the relationships between different geometries and different layers including duplicates, gaps, overlaps and other misalignments.  The editing tools are quite nice, too.

If you are working with a database layer, you have some other options as well.  PostGIS has a wonderful ST_MakeValid() function that attempts to create a valid representation from an invalid geometry without losing any vertices.  PostGIS can also provide preventative protection with table constraints to enforce consistent type, projection and dimensionality.

Regardless of your toolset or technology, maintaining a valid and well-formed dataset can be critical to your success.  Geospatial datatypes no different in that regard — just cheesier.

References:

{% put styles %}

{% endput %}

Posted in Database, Open Source GIS, Techniques on Jun 08, 2022.