Importing GeoIP2 and GeoLite2 databases to MySQL
On Wednesday, November 27, 2024 we updated this tutorial to address an issue in which some IPv4 and IPv6 addresses could overlap. If you used a version of this tutorial prior to that date, please consult the updated instructions for querying the data in the following sections below:
This guide will show you how to import GeoIP2 or GeoLite2 databases into MySQL so that they can be easily queried and manipulated on your server.
Importing a CSV database consists of downloading the database, extracting it, creating tables to hold the data, and indexing those tables for faster querying.
Download and extract the database
Begin by making sure you have downloaded the latest version of the GeoIP2 or GeoLite2 database that you want to import. You can download databases through your account portal. CSV format databases are shipped as a single zip file. See our CSV database documentation for details about the zip structure and contents of the archive. For the purposes of this tutorial, we will be using the GeoIP2 City CSV file, but you can get information about any of our CSV format databases and adapt the following instructions accordingly.
Keep your database updated
If you are importing databases for use in an ongoing manner, you will want to automate the process of downloading and extracting the CSV files so that you can be sure that your database is always up to date.
Extract the CSV files
Once you have downloaded the database, unpack the zip file in your desired directory. If you are importing the GeoIP2 City database, you will have a number of files. In this tutorial we will be working with the following files:
GeoIP2-City-Blocks-IPv4.csv
GeoIP2-City-Blocks-IPv6.csv
GeoIP2-City-Locations-en.csv
If you are working in a language other than English, you can choose the
appropriate Locations
file from the zip archive. For example, if you want to
load Chinese location names into MySQL, you can use
GeoIP2-City-Locations-zh-CN.csv
instead of GeoIP2-City-Locations-en.csv
. Our
documentation on CSV format databases includes a
list of all location files currently included in GeoIP2 and GeoLite2 databases.
Create a table for network data
First we create a table to hold the network information contained in
GeoIP2-City-Blocks-IPv4.csv
and GeoIP2-City-Blocks-IPv6.csv
.
Converting the network field
You can find the full description of the schema for these files in
the section on Blocks
files for GeoIP2 and GeoLite2 CSV databases.
If you are working with a different database, you can find the Blocks
file
schema for the appropriate database, and adapt the
table to meet that structure.
The network
field of the database uses CIDR notation. Unfortunately MySQL
doesn’t offer any capabilities to work with data in that format, so we’ll first
have to convert the networks into something else that we can easily query later
on. We chose to represent networks as a pair of IP addresses that are the first
and last address in the network, respectively. We will convert this field so
that both of these IP addresses will be represented as hexadecimal numbers.
We can use the database conversion tool to convert this field to hexadecimal numbers. Once you have downloaded the program and installed it to the same directory where you have extracted your CSV files, you can run it:
1$ ./geoip2-csv-converter -block-file GeoIP2-City-Blocks-IPv4.csv -include-hex-range -output-file GeoIP2-City-Blocks-IPv4-Hex.csv
2$ ./geoip2-csv-converter -block-file GeoIP2-City-Blocks-IPv6.csv -include-hex-range -output-file GeoIP2-City-Blocks-IPv6-Hex.csv
Schema
We can now create a table called geoip2_network
to hold the data we just
converted. We’ll represent the IP addresses using the type varbinary(16)
,
which will be large enough to represent 128-bit (16 byte) IPv6 addresses.
1create table geoip2_network (
2 network_start varbinary(16) not null,
3 network_end varbinary(16) not null,
4 geoname_id int,
5 registered_country_geoname_id int,
6 represented_country_geoname_id int,
7 is_anonymous_proxy bool,
8 is_satellite_provider bool,
9 postal_code text,
10 latitude float,
11 longitude float,
12 accuracy_radius int,
13 is_anycast bool,
14 index(network_start),
15 index(network_end)
16);
Note that we’re adding two separate indexes for network_start
and
network_end
. If we used a composite index over both columns, we wouldn’t be
able to speed up the queries we’ll be using later on.
Load data into the network table
We can now import the contents of GeoIP2-City-Blocks-IPv4.csv
and
GeoIP2-City-Blocks-IPv6.csv
into the table we just created.
We’ll start by loading the converted IPv6 data:
1load data infile '/var/lib/mysql-files/GeoIP2-City-Blocks-IPv6-Hex.csv'
2into table geoip2_network
3fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 rows
4(@network_start, @network_end, @geoname_id, @registered_country_geoname_id, @represented_country_geoname_id,
5 @is_anonymous_proxy, @is_satellite_provider, @postal_code, @latitude, @longitude, @accuracy_radius, @is_anycast)
6set network_start = unhex(@network_start),
7 network_end = unhex(@network_end),
8 geoname_id = nullif(@geoname_id, ''),
9 registered_country_geoname_id = nullif(@registered_country_geoname_id, ''),
10 represented_country_geoname_id = nullif(@represented_country_geoname_id, ''),
11 is_anonymous_proxy = nullif(@is_anonymous_proxy, ''),
12 is_satellite_provider = nullif(@is_satellite_provider, ''),
13 postal_code = nullif(@postal_code, ''),
14 latitude = nullif(@latitude, ''),
15 longitude = nullif(@longitude, ''),
16 accuracy_radius = nullif(@accuracy_radius, ''),
17 is_anycast = nullif(@is_anycast, '');
We can load the converted IPv4 data in the same way:
1load data infile '/var/lib/mysql-files/GeoIP2-City-Blocks-IPv4-Hex.csv'
2into table geoip2_network
3fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 rows
4(@network_start, @network_end, @geoname_id, @registered_country_geoname_id, @represented_country_geoname_id,
5 @is_anonymous_proxy, @is_satellite_provider, @postal_code, @latitude, @longitude, @accuracy_radius, @is_anycast)
6set network_start = unhex(@network_start),
7 network_end = unhex(@network_end),
8 geoname_id = nullif(@geoname_id, ''),
9 registered_country_geoname_id = nullif(@registered_country_geoname_id, ''),
10 represented_country_geoname_id = nullif(@represented_country_geoname_id, ''),
11 is_anonymous_proxy = nullif(@is_anonymous_proxy, ''),
12 is_satellite_provider = nullif(@is_satellite_provider, ''),
13 postal_code = nullif(@postal_code, ''),
14 latitude = nullif(@latitude, ''),
15 longitude = nullif(@longitude, ''),
16 accuracy_radius = nullif(@accuracy_radius, ''),
17 is_anycast = nullif(@is_anycast, '');
Note, the data will be imported even if the CSV file has extra columns that are missing in the table.
Test our table by querying it
With everything loaded we’re now ready to look up an IP address in the database.
As we’re representing IP addresses as varbinary(16)
, we’ll first have to
convert the textual representation of the IP address we’re interested in to that
same type using MySQL’s built-in inet6_aton
function.
1select geoname_id, registered_country_geoname_id, represented_country_geoname_id,
2 postal_code, latitude, longitude, accuracy_radius
3from geoip2_network
4where
5 length(inet6_aton('214.0.0.0')) = length(network_start)
6 and inet6_aton('214.0.0.0') between network_start and network_end
7limit 1;
1+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
2| geoname_id | registered_country_geoname_id | represented_country_geoname_id | postal_code | latitude | longitude | accuracy_radius |
3+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
4| 6252001 | 6252001 | NULL | NULL | 37.751 | -97.822 | 1000 |
5+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
61 row in set (0.03 sec)
While that yields the correct results, we notice that the query performance could be better. Let’s improve it.
Order the table for faster searching
One way of speeding up the query is the addition of order by network_end
like
so:
1select geoname_id, registered_country_geoname_id, represented_country_geoname_id,
2 postal_code, latitude, longitude, accuracy_radius
3from geoip2_network
4where
5 length(inet6_aton('214.0.0.0')) = length(network_start)
6 and inet6_aton('214.0.0.0') between network_start and network_end
7order by network_end
8limit 1;
1+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
2| geoname_id | registered_country_geoname_id | represented_country_geoname_id | postal_code | latitude | longitude | accuracy_radius |
3+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
4| 6252001 | 6252001 | NULL | NULL | 37.751 | -97.822 | 1000 |
5+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
61 row in set (0.00 sec)
While that addresses the performance concern we’ve had with the previous query, this query will still perform poorly for addresses not contained in our GeoIP2 database:
1select geoname_id, registered_country_geoname_id, represented_country_geoname_id,
2 postal_code, latitude, longitude, accuracy_radius
3from geoip2_network
4where
5 length(inet6_aton('127.0.0.1')) = length(network_start)
6 and inet6_aton('127.0.0.1') between network_start and network_end
7order by network_end
8limit 1;
1Empty set (4.45 sec)
Break up our query for faster searching
We can work around this problem by breaking our query up into two parts such that MySQL will be able to use the indexes we’ve created more efficiently:
1select geoname_id, registered_country_geoname_id, represented_country_geoname_id,
2 postal_code, latitude, longitude, accuracy_radius
3from (
4 select *
5 from geoip2_network
6 where
7 length(inet6_aton('214.0.0.0')) = length(network_start)
8 and inet6_aton('214.0.0.0') >= network_start
9 order by network_start desc
10 limit 1
11) net
12where inet6_aton('214.0.0.0') <= network_end;
1+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
2| geoname_id | registered_country_geoname_id | represented_country_geoname_id | postal_code | latitude | longitude | accuracy_radius |
3+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
4| 6252001 | 6252001 | NULL | NULL | 37.751 | -97.822 | 1000 |
5+------------+-------------------------------+--------------------------------+-------------+----------+-----------+-----------------+
61 row in set (0.00 sec)
1select geoname_id, registered_country_geoname_id, represented_country_geoname_id,
2 postal_code, latitude, longitude, accuracy_radius
3from (
4 select *
5 from geoip2_network
6 where
7 length(inet6_aton('127.0.0.1')) = length(network_start)
8 and inet6_aton('127.0.0.1') >= network_start
9 order by network_start desc
10 limit 1
11) net
12where inet6_aton('127.0.0.1') <= network_end;
1Empty set (0.00 sec)
Using that construct gives us good query performance for all addresses, regardless of whether or not the GeoIP2 database contains any information about them. Depending on your application, you might want to consider encapsulating this complexity/verbosity in a function. Alternatively, MySQL also offers spatial data types that can be used to achieve similar performance while allowing for queries to be expressed more naturally.
Optional: Create a table for locations data
If postal_code
, latitude
, longitude
, and accuracy_radius
are everything
we’re interested in we’d be done at this point and our application would be
easily able to query what it needs. However, GeoIP2 databases provide additional
location information. Note the geoname_id
field in our network table. This
field can be used to look up additional information about the geolocation from
the Locations
files we downloaded before. We’ll load this data into MySQL
next.
Locations table schema
We start by creating a table as before. As with the Blocks
files, the schema
for GeoIP2 and GeoLite2 City locations files can be found in the
CSV section of the database documentation.
We’ll name this table geoip2_location
:
1create table geoip2_location (
2 geoname_id int not null,
3 locale_code text not null,
4 continent_code text,
5 continent_name text,
6 country_iso_code text,
7 country_name text,
8 subdivision_1_iso_code text,
9 subdivision_1_name text,
10 subdivision_2_iso_code text,
11 subdivision_2_name text,
12 city_name text,
13 metro_code int,
14 time_zone text,
15 is_in_european_union bool,
16 primary key (geoname_id, locale_code(5))
17);
Load data into the locations table
We then populate our geoip2_location
table from a Locations
CSV file. In
this case we will be populating the table from GeoIP2-City-Locations-en.csv
.
Using the file with the -en
suffix will give us the names of geolocations in
English:
1load data infile '/var/lib/mysql-files/GeoIP2-City-Locations-en.csv'
2into table geoip2_location
3fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 rows (
4 geoname_id, locale_code, continent_code, continent_name,
5 @country_iso_code, @country_name, @subdivision_1_iso_code, @subdivision_1_name,
6 @subdivision_2_iso_code, @subdivision_2_name, @city_name, @metro_code, @time_zone,
7 is_in_european_union
8)
9set country_iso_code = nullif(@country_iso_code, ''),
10 country_name = nullif(@country_name, ''),
11 subdivision_1_iso_code = nullif(@subdivision_1_iso_code, ''),
12 subdivision_1_name = nullif(@subdivision_1_name, ''),
13 subdivision_2_iso_code = nullif(@subdivision_2_iso_code, ''),
14 subdivision_2_name = nullif(@subdivision_2_name, ''),
15 city_name = nullif(@city_name, ''),
16 metro_code = nullif(@metro_code, ''),
17 time_zone = nullif(@time_zone, '');
Note that there’s a number of different Locations
files available. The other
files with different language suffixes contain localized versions of the -en
data in different languages for some of the geoname_id
s. Depending on your
application’s needs you may decide to import additional Locations
files into
localized tables. For example, you could load GeoIP2-City-Locations-en.csv
into a table called geoip2_location-en
, and load
GeoIP2-City-Locations-zh-CN.csv
into a table called geoip2_location-zh
. You
could then query whichever locations table you needed for English or Chinese
respectively.
Query our tables
We can now use our geoip2_location
table to resolve the geoname_id
s provided
by our geoip2_network
table. For example:
1select latitude, longitude, accuracy_radius, continent_name, country_name, subdivision_1_name, city_name
2from (
3 select *
4 from geoip2_network
5 where
6 length(inet6_aton('214.0.0.0')) = length(network_start)
7 and inet6_aton('214.0.0.0') >= network_start
8 order by network_start desc
9 limit 1
10) net
11left join geoip2_location location on (
12 net.geoname_id = location.geoname_id and location.locale_code = 'en'
13)
14where inet6_aton('214.0.0.0') <= network_end;
1+----------+-----------+-----------------+----------------+---------------+--------------------+-----------+
2| latitude | longitude | accuracy_radius | continent_name | country_name | subdivision_1_name | city_name |
3+----------+-----------+-----------------+----------------+---------------+--------------------+-----------+
4| 37.751 | -97.822 | 1000 | North America | United States | NULL | NULL |
5+----------+-----------+-----------------+----------------+---------------+--------------------+-----------+
61 row in set (0.00 sec)
Here we were only interested in English results, but we can adjust our join condition if we were interested in different or additional languages.
Note how a left outer join is used. This is because additional location
information might not be available for any given row of our geoip2_network
table. For example, some IP addresses cannot be resolved to a city or
subdivision. With the left join we’d still receive latitude
, longitude
, and
accuracy_radius
as a result of our query if available, while an inner join
would result in zero rows if no additional location information was available.
In addition to the geoname_id
column that provides location information for a
network, there’s also registered_country_geoname_id
and
represented_country_geoname_id
, which provide location information about the
country in which the ISP has registered the network, and the country which is
represented by users of the IP address, respectively. The location data for both
can be included by additional joins:
1select latitude, longitude, accuracy_radius,
2 location.continent_name as location_continent_name,
3 location.country_name as location_country_name,
4 location.subdivision_1_name as location_subdivision_1_name,
5 location.city_name as location_city_name,
6 registered_country.continent_name as registered_country_continent_name,
7 registered_country.country_name as registered_country_country_name,
8 represented_country.continent_name as represented_country_continent_name,
9 represented_country.country_name as represented_country_country_name
10from (
11 select *
12 from geoip2_network
13 where
14 length(inet6_aton('214.0.0.0')) = length(network_start)
15 and inet6_aton('214.0.0.0') >= network_start
16 order by network_start desc
17 limit 1
18) net
19left join geoip2_location location on (
20 net.geoname_id = location.geoname_id and location.locale_code = 'en'
21)
22left join geoip2_location registered_country on (
23 net.registered_country_geoname_id = registered_country.geoname_id
24 and registered_country.locale_code = 'en'
25)
26left join geoip2_location represented_country on (
27 net.represented_country_geoname_id = represented_country.geoname_id
28 and represented_country.locale_code = 'en'
29)
30where inet6_aton('214.0.0.0') <= network_end;
1+----------+-----------+-----------------+-------------------------+-----------------------+-----------------------------+--------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+
2| latitude | longitude | accuracy_radius | location_continent_name | location_country_name | location_subdivision_1_name | location_city_name | registered_country_continent_name | registered_country_country_name | represented_country_continent_name | represented_country_country_name |
3+----------+-----------+-----------------+-------------------------+-----------------------+-----------------------------+--------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+
4| 37.751 | -97.822 | 1000 | North America | United States | NULL | NULL | North America | United States | NULL | NULL |
5+----------+-----------+-----------------+-------------------------+-----------------------+-----------------------------+--------------------+-----------------------------------+---------------------------------+------------------------------------+----------------------------------+
61 row in set (0.00 sec)