Skip to main content
Background Image
  1. PostgreSQL Posts/

GeoIP Geographic Reverse Lookup Optimization

·1551 words·8 mins· ·
Ruohang Feng
Author
Ruohang Feng
Pigsty Founder, @Vonng
Table of Contents

Efficient implementation of IP geolocation lookups

In application development, a ‘very common’ requirement is GeoIP conversion - converting source IP addresses from requests into corresponding geographic coordinates or administrative divisions (country-state-city-county-town-village). This functionality has many uses, such as analyzing geographic sources of website traffic or doing some shady things. Using PostgreSQL can achieve this requirement elegantly and efficiently with high performance and cost effectiveness.


0x01 Approach and Methods
#

Usually, IP geographic databases on the internet are in the format: start_ip, stop_ip, longitude, latitude, with some additional attribute fields like country codes, city codes, postal codes, etc. It looks roughly like this:

ColumnType
start_iptext
end_iptext
longitudetext
latitudetext
country_codetext
……text

Essentially, the core is mapping from IP address ranges to geographic coordinate points.

A typical query actually provides an IP address and returns the geographic range corresponding to that address. The logic expressed in SQL looks roughly like this:

SELECT longitude, latitude FROM geoip 
WHERE start_ip <= target_ip AND target_ip <= stop_ip;

However, to provide direct service, several issues need to be resolved:

  • First issue: Although IPv4 is actually a uint32, we’re completely accustomed to the textual representation like 123.123.123.123. This textual representation cannot be compared for size.
  • Second issue: The IP range here is represented by two IP boundary fields, so is this range an open or closed interval? Do we need an additional field to represent this?
  • Third issue: For efficient querying, how should indexes on two fields be established?
  • Fourth issue: We want all IP segments to not overlap with each other, but a simple unique constraint on (start_ip, stop_ip) cannot guarantee this - what should we do?

Fortunately, for PostgreSQL, these are not problems. The four issues above can be easily solved using PostgreSQL features.

  • Network data types: High-performance, compact, flexible network address representation.
  • Range types: Good abstraction for intervals, good support for interval queries and operations.
  • GiST indexes: Can be applied to both IP address ranges and geographic location points.
  • Exclude constraints: Generalized advanced UNIQUE constraints that fundamentally ensure data integrity.

0x01 Network Address Types
#

PostgreSQL provides data types for storing IPv4, IPv6, and MAC addresses, including cidr, inet, and macaddr, along with many common operation functions, eliminating the need to implement tedious repetitive functionality in programs.

The most common network address is IPv4 address, corresponding to PostgreSQL’s built-in inet type. The inet type can store IPv4, IPv6 addresses, or with an optional subnet. Of course, these detailed operations can be referenced in the documentation and won’t be detailed here.

One point to note is that although we know IPv4 is essentially an Unsigned Integer, storing it as INTEGER in the database actually doesn’t work because the SQL standard doesn’t support Unsigned usage, so half of the IP addresses would be interpreted as negative numbers, producing surprising results when comparing sizes. If you really want to store it this way, please use BIGINT. Moreover, directly facing a bunch of long integers is quite headache-inducing, so inet is the best choice.

If you need to convert between IP addresses (inet type) and corresponding integers, just perform addition and subtraction with 0.0.0.0; you can also use the following functions and create a type conversion to directly convert between inet and bigint:

-- inet to bigint
CREATE FUNCTION inet2int(inet) RETURNS bigint AS $$
SELECT $1 - inet '0.0.0.0';
$$ LANGUAGE SQL  IMMUTABLE RETURNS NULL ON NULL INPUT;

-- bigint to inet
CREATE FUNCTION int2inet(bigint) RETURNS inet AS $$
SELECT inet '0.0.0.0' + $1;
$$ LANGUAGE SQL  IMMUTABLE RETURNS NULL ON NULL INPUT;

-- create type conversion
CREATE CAST (inet AS bigint) WITH FUNCTION inet2int(inet);
CREATE CAST (bigint AS inet) WITH FUNCTION int2inet(bigint);

-- test
SELECT 123456::BIGINT::INET;
SELECT '1.2.3.4'::INET::BIGINT;

-- Generate random IP addresses
SELECT (random() * 4294967295)::BIGINT::INET;

Size comparison between inet values is also quite straightforward - just use size comparison operators directly. The actual comparison is of the underlying integer values. This solves the first problem.


0x02 Range Types
#

PostgreSQL’s Range types are a very practical feature. Like arrays, they belong to a generic type. Any data type that can be B-tree indexed (can be compared for size) can serve as the base type for range types. They’re particularly suitable for representing intervals: integer intervals, time intervals, IP address ranges, etc. They have relatively detailed consideration for open intervals, closed intervals, and interval indexing issues.

PostgreSQL has built-in predefined int4range, int8range, numrange, tsrange, tstzrange, daterange that are ready to use out of the box. But it doesn’t provide range types corresponding to network addresses, though creating one yourself is very simple:

CREATE TYPE inetrange AS RANGE(SUBTYPE = inet)

Of course, to efficiently support GiST index queries, you also need to implement a distance metric that tells the index how to calculate the distance between two inet values:

-- Define distance metric between basic types
CREATE FUNCTION inet_diff(x INET, y INET) RETURNS FLOAT AS $$
  SELECT (x - y) :: FLOAT;
$$ LANGUAGE SQL IMMUTABLE STRICT;

-- Recreate inetrange type using the newly defined distance metric
CREATE TYPE inetrange AS RANGE(
  SUBTYPE = inet,
  SUBTYPE_DIFF = inet_diff
)

Fortunately, the distance definition between two network addresses naturally has a very simple calculation method - just subtract them.

This newly defined type is also simple to use, with constructor functions automatically generated:

geo=# select misc.inetrange('64.60.116.156','64.60.116.161','[)');
inetrange | [64.60.116.156,64.60.116.161)

geo=# select '[64.60.116.156,64.60.116.161]'::inetrange;
inetrange | [64.60.116.156,64.60.116.161]

Square brackets and round brackets represent closed and open intervals respectively, consistent with mathematical notation.

Also, detecting whether an IP address falls within a given IP range is quite straightforward:

geo=# select '[64.60.116.156,64.60.116.161]'::inetrange @> '64.60.116.160'::inet as res;
res | t

With range types, we can start building our data table.


0x03 Range Indexes
#

Actually, finding IP geographic correspondence data took me over an hour, but completing this requirement only took a few minutes.

Assuming we already have such data:

create table geoips
(
  ips          inetrange,
  geo          geometry(Point),
  country_code text,
  region_code  text,
  city_name    text,
  ad_code      text,
  postal_code  text
);

The data inside looks roughly like this:

SELECT ips,ST_AsText(geo) as geo,country_code FROM geoips

 [64.60.116.156,64.60.116.161] | POINT(-117.853 33.7878) | US
 [64.60.116.139,64.60.116.154] | POINT(-117.853 33.7878) | US
 [64.60.116.138,64.60.116.138] | POINT(-117.76 33.7081)  | US

Then querying records containing a certain IP address can be written as:

SELECT * FROM ip WHERE ips @> inet '67.185.41.77';

For 6 million records, about a 600M table, brute force table scanning on the author’s machine averaged 900ms, roughly single-core QPS is 1.1, and a 48-core production machine would be around thirty to forty. Definitely unusable.

CREATE INDEX ON geoips USING GiST(ips);

Query time changed from 1 second to 340 microseconds, roughly a 3000x improvement.

-- pgbench
\set ip random(0,4294967295)
SELECT * FROM geoips WHERE ips @> :ip::BIGINT::INET;

-- result
latency average = 0.342 ms
tps = 2925.100036 (including connections establishing)
tps = 2926.151762 (excluding connections establishing)

Converted to production QPS, it’s roughly 100,000 QPS - absolutely delightful.

If you need to convert geographic coordinates to administrative divisions, you can refer to the previous article: Using PostGIS to efficiently solve administrative division geocoding problems.

One geocoding also takes about 100 microseconds. The overall QPS for converting from IP to province-city-district-county on a single machine can easily handle tens of thousands (full load all day is equivalent to seven to eight billion calls, you simply can’t max it out).


0x04 EXCLUDE Constraints
#

The problem has been basically solved at this point, but there’s still one issue. How to avoid the embarrassing situation of one IP returning two records?

Data integrity is extremely important, but data integrity guaranteed by applications isn’t always reliable: people make mistakes, programs have bugs. If data integrity can be enforced through database constraints, that would be ideal.

However, some constraints are quite complex, such as ensuring IP ranges in a table don’t overlap, similarly ensuring boundaries of various cities in a geographic division table don’t overlap. Traditionally implementing such guarantees was quite difficult: for example, UNIQUE constraints cannot express this semantics, and CHECK with stored procedures or triggers, while capable of implementing such checks, are quite tricky. PostgreSQL’s EXCLUDE constraints can elegantly solve this problem. Modify our geoips table:

create table geoips
(
  ips          inetrange,
  geo          geometry(Point),
  country_code text,
  region_code  text,
  city_name    text,
  ad_code      text,
  postal_code  text,
  EXCLUDE USING gist (ips WITH &&) DEFERRABLE INITIALLY DEFERRED 
);

Here EXCLUDE USING gist (ips WITH &&) means that overlapping ranges are not allowed on the ips field - newly inserted fields cannot overlap with any existing ranges (&& being true). And DEFERRABLE INITIALLY IMMEDIATE means to check constraints on all rows at the end of the statement. Creating this constraint will automatically create a GIST index on the ips field, so manual creation is unnecessary.


0x05 Summary
#

This article introduced how to use PostgreSQL features to efficiently and elegantly solve the IP geolocation lookup problem. Performance is excellent - 0.3ms to locate among 6 million records; complexity is ridiculously low - just one table DDL solves this problem without even explicitly creating indexes; data integrity is fully guaranteed - problems that would take hundreds of lines of code to solve now only require adding constraints, fundamentally ensuring data integrity.

PostgreSQL is so awesome, quickly learn and use it! What? You ask me where to find the data? Search for MaxMind for the truth - you can find free GeoIP data in hidden little corners.

Related

Efficient Administrative Region Lookup with PostGIS
·2180 words·11 mins
How to efficiently solve the typical reverse geocoding problem: determining administrative regions based on user coordinates.
KNN Ultimate Optimization: From RDS to PostGIS
·4114 words·20 mins
Ultimate optimization of KNN problems, from traditional relational design to PostGIS
PostgreSQL Trigger Usage Considerations
·1069 words·6 mins
Detailed understanding of trigger management and usage in PostgreSQL
PostgreSQL Development Convention (2018 Edition)
·3438 words·17 mins
Without rules, there can be no order. This article compiles a development specification for PostgreSQL database principles and features, which can reduce confusion encountered when using PostgreSQL.
Distinct On: Remove Duplicate Data
·557 words·3 mins
Use Distinct On extension clause to quickly find records with maximum/minimum values within groups
Function Volatility Classification Levels
·644 words·4 mins
PostgreSQL functions have three volatility levels by default. Proper use can significantly improve performance.