How to Do a "Simple" Data Migration

Imagine you have a bookshelf full of books. At some point, you need to move those books to a new shelf. The old bookshelf and the new one have different structures, and you need to organise the books so they fit properly in the new space.

In that scenario, you will have to select what actually goes to the new shelf, respecting its limits and the final organisation of the books according to the new structure.

This analogy illustrates the steps required to migrate data between different databases.

The topics presented here are the result of a data migration project between databases of different technologies, different structures, different business rules, and a tight development and testing deadline. The goal is to show a “simple” workflow without relying on any specific technology beyond SQL itself. I don’t consider these steps the best possible strategy — just what I was able to do with what I had available.

Considerations #

Things to observe before starting the migration:

Understand the DBMS Limitations (Source and Target) #

Understanding how the DBMS and its tools work will help anticipate problems that may occur during the migration. Some scripts, depending on how the database and the business are structured, can contain more than 1 million records, and not every tool or DBMS will be able to execute them. Some clients perform better than others, and some may perform better depending on the target DBMS. Other clients may have issues with the script’s locale configuration, potentially skipping some statements.

Beyond the tooling, the DBMS itself can be a limiting factor. One common problem is how it processes scripts, to the point of locking up the server.

Identify the Structural Peculiarities of Each Database #

Each table may have one or more dependencies (foreign keys or FKs). These dependencies affect the order in which data must be imported. In addition, tables containing enums and code data should be listed and handled before the main data. It is also necessary to check for duplicate or similar records across multiple sources to avoid redundancy.

Indexes on source databases can speed up reading and generating intermediate files. On target databases, indexes may limit the number of records inserted or updated per second. In both cases, it is necessary to evaluate the need and impact of creating or removing them.

Avoid Subselects with Inline Arguments #

Subselects will sometimes be necessary. The issue arises when you need to use a column value to perform additional lookups — this significantly increases execution time. The solution is to transform that lookup into a dataset to be used as an argument in an IN clause. For example:

-- Avoid:
SELECT *
FROM TABLE T
WHERE id <> (SELECT id FROM TABLE2 T2 WHERE T.value = T2.value)

-- Prefer:
SELECT *
FROM TABLE T
WHERE id IN (SELECT id FROM TABLE2 T2 WHERE T2.value = argument)

-- Even better:
SELECT *
FROM TABLE T
WHERE EXISTS (SELECT id FROM TABLE2 T2 WHERE T2.value = id)

Process #

Create a “Composer” Script #

A typical SQL script has the following structure:

SELECT column1, column2
FROM table
WHERE column3 = argument

The Composer is a script structure that extracts data from the source database and formats the output in the pattern expected by the target database. For example:

SELECT 'INSERT INTO table (column1, column2) VALUES (''' + source_column1 + ''', ''' + source_column2 + ''');' AS "-- TABLE"
FROM table
WHERE columnX = argumentY

The result is a file ready to execute on the target database:

-- TABLE
INSERT INTO table (column1, column2) VALUES ('Value1', 'Value2');
INSERT INTO table (column1, column2) VALUES ('Value3', 'Value4');

Use Functions/Procedures for Critical Data #

Some critical data may have duplicates, incomplete records, or require updates due to business rules. Functions and procedures allow you to control this data by creating auxiliary tables:

SUCCESS_TABLE: records successfully saved data.
FAILURE_TABLE: identifies inconsistent or duplicate records.

SELECT 'SELECT function(''' + source_column1 + ''', ''' + source_column2 + ''');' AS "-- FUNCTION_TABLE"
FROM table
WHERE columnX = argumentY

Split Results into Separate Files #

To avoid a single file containing all INSERT statements — which is hard to manage or update — it is recommended to split records by table or business rule. Some of those separate files still reached 300MB in size.

Verify That All Records Were Processed #

There were cases where some records were not inserted due to encoding conflicts between the tool and the executed script, causing some records to be silently skipped. Verifying the record count after script execution will prevent unwanted surprises.

Final Thoughts #

Point-in-time extractions with few records, from non-conventional data sources (CSV, Excel, etc.), with simple business models, can benefit from the workflow presented here.

Some situations may allow the use of tools like Pentaho PDI or Airflow. In others, corporate policies will restrict access to target systems, requiring that all changes be made through scripts executed by credentialed personnel.