How to Implement ETL Processes with TypeScript

Avatar

By squashlabs, Last Updated: October 13, 2023

How to Implement ETL Processes with TypeScript

The Purpose of TypeScript in ETL

TypeScript is a strongly-typed superset of JavaScript that adds static typing to the language. It compiles down to plain JavaScript, making it compatible with all JavaScript libraries and frameworks. TypeScript provides developers with the ability to catch potential errors and bugs during the development phase, leading to more robust and maintainable code.

When it comes to implementing ETL (Extract, Transform, Load) processes, TypeScript can be a valuable tool. ETL processes involve extracting data from various sources, transforming it into a desired format, and loading it into a target system or data warehouse. These processes often deal with large volumes of data and require careful handling to ensure accurate and efficient data processing.

The purpose of TypeScript in ETL is to provide developers with a type-safe and efficient way to implement these processes. By using TypeScript, developers can leverage the benefits of static typing to catch errors early on, improve code quality, and enhance the overall reliability of their ETL workflows.

Related Article: How to Get an Object Value by Dynamic Keys in TypeScript

Example 1: Basic TypeScript ETL Workflow

To illustrate the purpose of TypeScript in ETL, let’s consider a basic ETL workflow that involves extracting data from a CSV file, transforming it, and loading it into a PostgreSQL database. We’ll use the popular library typeorm for interacting with the database.

First, let’s install the necessary dependencies:

npm install typescript typeorm csv-parser pg

Next, we’ll create a TypeScript file, etl.ts, and implement the ETL workflow:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';
import { createConnection } from 'typeorm';

interface UserData {
name: string;
age: number;
email: string;
}

async function extractDataFromCSV(filePath: string): Promise<UserData[]> {
return new Promise((resolve, reject) => {
const data: UserData[] = [];

fs.createReadStream(filePath)
.pipe(csvParser())
.on('data', (row: any) => {
data.push({
name: row.name,
age: parseInt(row.age),
email: row.email,
});
})
.on('end', () => {
resolve(data);
})
.on('error', (error: Error) => {
reject(error);
});
});
}

async function transformData(userData: UserData[]): Promise<UserData[]> {
return userData.map((user) => ({
...user,
name: user.name.toUpperCase(),
}));
}

async function loadDataToDatabase(userData: UserData[]): Promise {
const connection = await createConnection();

await connection.manager.insert(UserData, userData);

await connection.close();
}

async function runETL(filePath: string): Promise {
try {
const extractedData = await extractDataFromCSV(filePath);
const transformedData = await transformData(extractedData);
await loadDataToDatabase(transformedData);
console.log('ETL process completed successfully.');
} catch (error) {
console.error('An error occurred during the ETL process:', error);
}
}

runETL('data.csv');

In this example, we define an interface UserData to represent the structure of the data extracted from the CSV file. We then implement three functions: extractDataFromCSV, transformData, and loadDataToDatabase. These functions handle the extraction, transformation, and loading steps of the ETL process, respectively.

Example 2: TypeScript ETL Pipeline with Type Checking

Let’s consider a more complex ETL scenario where we have to extract data from multiple sources, perform a series of transformations, and load the transformed data into multiple target systems. We’ll use the axios library for making HTTP requests and the node-fetch library for fetching data from an external API.

First, let’s install the necessary dependencies:

npm install typescript axios node-fetch

Next, we’ll create a TypeScript file, etl.ts, and implement the ETL pipeline:

import axios from 'axios';
import fetch from 'node-fetch';

interface UserData {
name: string;
age: number;
email: string;
}

interface APIResponse {
results: {
name: {
first: string;
last: string;
};
dob: {
age: number;
};
email: string;
}[];
}

async function extractDataFromAPI(): Promise<UserData[]> {
const response = await axios.get('https://randomuser.me/api/?results=10');
const { results } = response.data;

return results.map((result) => ({
name: `${result.name.first} ${result.name.last}`,
age: result.dob.age,
email: result.email,
}));
}

async function extractDataFromCSV(filePath: string): Promise<UserData[]> {
// implementation omitted for brevity
}

async function transformData(userData: UserData[]): Promise<UserData[]> {
return userData.map((user) => ({
...user,
name: user.name.toUpperCase(),
}));
}

async function loadDataToDatabase(userData: UserData[]): Promise {
// implementation omitted for brevity
}

async function runETL(): Promise {
try {
const apiData = await extractDataFromAPI();
const csvData = await extractDataFromCSV('data.csv');
const combinedData = [...apiData, ...csvData];
const transformedData = await transformData(combinedData);
await loadDataToDatabase(transformedData);
console.log('ETL pipeline completed successfully.');
} catch (error) {
console.error('An error occurred during the ETL pipeline:', error);
}
}

runETL();

In this example, we define an interface APIResponse to represent the structure of the response from the external API. We then implement the extractDataFromAPI function to fetch data from the API and transform it into the UserData format.

Type Safety in ETL Processes with TypeScript

When implementing ETL processes, data integrity and accuracy are of utmost importance. Type safety plays a crucial role in ensuring that the data is correctly formatted and compatible with the target system.

TypeScript provides static typing, which allows developers to define the types of variables, function parameters, and return values. This helps catch potential errors and bugs at compile-time, rather than at runtime, leading to more robust and reliable ETL processes.

In the context of ETL, type safety can prevent common issues such as:

– Mismatched data types: Type safety helps ensure that the data extracted from various sources is compatible with the expected types in the transformation and loading steps. For example, if a CSV file contains age values as strings, TypeScript can catch this mismatch and prompt the developer to handle the conversion appropriately.

– Missing or invalid data: Type safety can help detect missing or invalid data during the extraction, transformation, or loading steps. For example, if a required field is missing in a CSV file, TypeScript can flag this as an error and prompt the developer to handle the situation accordingly.

– Inconsistent data structures: ETL processes often involve combining data from multiple sources, such as APIs, databases, or files. Type safety can ensure that the data structures are consistent across different sources, reducing the risk of data corruption or loss.

Related Article: Using ESLint & eslint-config-standard-with-typescript

Example 1: Type Safety in Data Transformation

Let’s consider an example where we need to transform a CSV file containing user data into a JSON format suitable for loading into a NoSQL database. We’ll use the csv-parser library for parsing the CSV file and the jsonfile library for writing the transformed data to a JSON file.

First, let’s install the necessary dependencies:

npm install typescript csv-parser jsonfile

Next, we’ll create a TypeScript file, transform.ts, and implement the data transformation:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';
import * as jsonfile from 'jsonfile';

interface UserData {
name: string;
age: number;
email: string;
}

function transformCSVToJSON(csvFilePath: string, jsonFilePath: string): void {
const transformedData: UserData[] = [];

fs.createReadStream(csvFilePath)
.pipe(csvParser())
.on('data', (row: any) => {
transformedData.push({
name: row.name,
age: parseInt(row.age),
email: row.email,
});
})
.on('end', () => {
jsonfile.writeFile(jsonFilePath, transformedData, (error) => {
if (error) {
console.error('An error occurred while writing the JSON file:', error);
} else {
console.log('Data transformation completed successfully.');
}
});
})
.on('error', (error: Error) => {
console.error('An error occurred during data transformation:', error);
});
}

transformCSVToJSON('data.csv', 'transformedData.json');

In this example, we define an interface UserData to represent the structure of the transformed data. By specifying the types of the name, age, and email fields, we ensure that the transformed data adheres to the expected structure.

Example 2: Type Safety in Data Loading

Let’s consider an example where we need to load data from a CSV file into a PostgreSQL database. We’ll use the typeorm library for interacting with the database.

First, let’s install the necessary dependencies:

npm install typescript typeorm csv-parser pg

Next, we’ll create a TypeScript file, load.ts, and implement the data loading process:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';
import { createConnection } from 'typeorm';

interface UserData {
name: string;
age: number;
email: string;
}

async function loadDataToDatabase(filePath: string): Promise {
const connection = await createConnection();

fs.createReadStream(filePath)
.pipe(csvParser())
.on('data', async (row: any) => {
const userData: UserData = {
name: row.name,
age: parseInt(row.age),
email: row.email,
};

await connection.manager.insert(UserData, userData);
})
.on('end', async () => {
await connection.close();
console.log('Data loading completed successfully.');
})
.on('error', (error: Error) => {
console.error('An error occurred during data loading:', error);
});
}

loadDataToDatabase('data.csv');

In this example, we define an interface UserData to represent the structure of the data to be loaded into the database. By specifying the types of the name, age, and email fields, we ensure that the data being inserted into the database matches the expected structure.

Benefits of Using TypeScript for ETL

Using TypeScript for ETL processes offers several benefits that can greatly enhance the development and maintenance of these workflows.

Related Article: How to Work with Anonymous Classes in TypeScript

1. Enhanced Code Quality and Maintainability

TypeScript’s static typing allows developers to catch potential errors and bugs early on, leading to more robust and reliable code. By defining the types of variables, function parameters, and return values, developers can ensure that the data is correctly formatted and compatible with the target system. This helps reduce the occurrence of runtime errors and improves the overall quality of the codebase.

Additionally, the use of TypeScript’s type system improves code maintainability. By providing explicit type annotations and enforcing type checking, the code becomes more self-documenting and easier to understand. As a result, developers can quickly grasp the intent and behavior of the code, making it easier to maintain and extend over time.

2. Early Error Detection

TypeScript’s static type checking allows developers to catch potential errors during the development phase, before the code is even executed. This early error detection helps reduce the time spent on debugging and testing, as many common errors are caught at compile-time. By addressing these errors early on, developers can ensure that the ETL processes are implemented correctly and produce accurate results.

3. Improved Collaboration

When working on ETL processes, collaboration between developers and stakeholders is essential. TypeScript’s static typing improves collaboration by providing a common language for discussing and understanding the code. With clearly defined types and interfaces, developers can communicate their intentions and expectations more effectively. This reduces the likelihood of misunderstandings and helps ensure that everyone involved in the ETL processes is on the same page.

Related Article: Building a Rules Engine with TypeScript

4. Tooling and Ecosystem Support

TypeScript has a rich ecosystem of tools and libraries that can greatly simplify the development of ETL processes. From useful IDEs with intelligent autocompletion and refactoring capabilities to build tools and testing frameworks, TypeScript provides a robust foundation for building and maintaining ETL workflows. Additionally, many popular libraries and frameworks have first-class support for TypeScript, allowing developers to leverage their full potential while implementing ETL processes.

Example 1: Enhanced Code Quality and Maintainability

Let’s consider an example where we need to extract data from a CSV file, transform it, and load it into a PostgreSQL database. We’ll use the typeorm library for interacting with the database.

First, let’s install the necessary dependencies:

npm install typescript typeorm csv-parser pg

Next, we’ll create a TypeScript file, etl.ts, and implement the ETL workflow:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';
import { createConnection } from 'typeorm';

interface UserData {
name: string;
age: number;
email: string;
}

async function extractDataFromCSV(filePath: string): Promise<UserData[]> {
return new Promise((resolve, reject) => {
const data: UserData[] = [];

fs.createReadStream(filePath)
.pipe(csvParser())
.on('data', (row: any) => {
data.push({
name: row.name,
age: parseInt(row.age),
email: row.email,
});
})
.on('end', () => {
resolve(data);
})
.on('error', (error: Error) => {
reject(error);
});
});
}

async function transformData(userData: UserData[]): Promise<UserData[]> {
return userData.map((user) => ({
...user,
name: user.name.toUpperCase(),
}));
}

async function loadDataToDatabase(userData: UserData[]): Promise {
const connection = await createConnection();

await connection.manager.insert(UserData, userData);

await connection.close();
}

async function runETL(filePath: string): Promise {
try {
const extractedData = await extractDataFromCSV(filePath);
const transformedData = await transformData(extractedData);
await loadDataToDatabase(transformedData);
console.log('ETL process completed successfully.');
} catch (error) {
console.error('An error occurred during the ETL process:', error);
}
}

runETL('data.csv');

In this example, by using TypeScript, we can define the types of the function parameters and return values, ensuring that the data is correctly formatted at each step. This helps catch potential errors early on and allows for better code understanding and maintenance.

For example, if we mistakenly pass a non-existent file path to the runETL function, TypeScript will throw a compile-time error, alerting us to the issue and preventing the code from being executed. This early error detection improves code quality and reduces the likelihood of runtime errors.

Example 2: Early Error Detection

Let’s consider an example where we need to transform data extracted from a CSV file into a specific format. We’ll use the csv-parser library for parsing the CSV file.

First, let’s install the necessary dependencies:

npm install typescript csv-parser

Next, we’ll create a TypeScript file, transform.ts, and implement the data transformation:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';

interface UserData {
name: string;
age: number;
email: string;
}

function transformCSVData(csvFilePath: string): UserData[] {
const transformedData: UserData[] = [];

fs.createReadStream(csvFilePath)
.pipe(csvParser())
.on('data', (row: any) => {
transformedData.push({
name: row.name,
age: parseInt(row.age),
email: row.email,
});
})
.on('end', () => {
console.log('Data transformation completed successfully.');
})
.on('error', (error: Error) => {
console.error('An error occurred during data transformation:', error);
});

return transformedData;
}

const transformedData = transformCSVData('data.csv');
console.log(transformedData);

In this example, by using TypeScript, we can ensure that the data extracted from the CSV file is correctly formatted and compatible with the UserData interface. If the age field in the CSV file is mistakenly provided as a string instead of a number, TypeScript will throw a compile-time error, alerting us to the issue and preventing the code from being executed.

This early error detection helps catch potential issues early on and allows developers to address them before they cause problems in downstream processes.

Related Article: TypeScript ETL (Extract, Transform, Load) Tutorial

Using TypeScript for Data Extraction in ETL

Data extraction is the first step in the ETL process, where data is fetched or retrieved from various sources such as databases, APIs, files, or external systems. TypeScript can be used effectively for data extraction in ETL, providing type safety and ensuring the integrity of the extracted data.

1. Extracting Data from Databases

When extracting data from databases, TypeScript can help ensure the integrity and compatibility of the extracted data. By leveraging TypeScript’s static typing, developers can define interfaces or types that match the structure of the database tables or query results. This allows for better code understanding and helps catch potential errors or mismatches between the extracted data and the expected structure.

Here’s an example of extracting data from a PostgreSQL database using TypeScript:

import { createConnection } from 'typeorm';

interface User {
id: number;
name: string;
age: number;
email: string;
}

async function extractDataFromDatabase(): Promise<User[]> {
const connection = await createConnection();

const userRepository = connection.getRepository(User);
const users = await userRepository.find();

await connection.close();

return users;
}

const extractedData = await extractDataFromDatabase();
console.log(extractedData);

In this example, we define an interface User that represents the structure of the data to be extracted from the database. By specifying the types of the id, name, age, and email fields, we ensure that the extracted data adheres to the expected structure.

2. Extracting Data from APIs

When extracting data from APIs, TypeScript can provide type safety and help ensure that the extracted data matches the expected structure. By defining interfaces or types that represent the structure of the API response, developers can catch potential errors or inconsistencies early on.

Here’s an example of extracting data from a REST API using TypeScript:

import axios from 'axios';

interface User {
id: number;
name: string;
age: number;
email: string;
}

async function extractDataFromAPI(): Promise<User[]> {
const response = await axios.get<User[]>('https://api.example.com/users');

return response.data;
}

const extractedData = await extractDataFromAPI();
console.log(extractedData);

In this example, we define an interface User that represents the structure of the data to be extracted from the API. By specifying the types of the id, name, age, and email fields, we ensure that the extracted data adheres to the expected structure.

Related Article: Tutorial on Circuit Breaker Pattern in TypeScript

The Role of Type Checking in TypeScript ETL

TypeScript’s type checking plays a crucial role in ensuring the integrity and correctness of ETL processes. By enforcing type safety and catching potential errors at compile-time, type checking helps reduce the likelihood of runtime errors and improves the overall reliability of the ETL workflows.

1. Catching Errors Early

TypeScript’s type checking allows developers to catch potential errors and bugs early on, before the code is even executed. By defining the types of variables, function parameters, and return values, developers can ensure that the data is correctly formatted and compatible with the target system.

For example, if we mistakenly pass a string instead of a number to a function expecting a numeric parameter, TypeScript will throw a compile-time error, alerting us to the issue and preventing the code from being executed. This early error detection improves code quality and reduces the likelihood of runtime errors during the ETL process.

2. Improving Code Readability and Maintainability

TypeScript’s type system improves code readability and maintainability by providing explicit type annotations and enforcing type checking. By specifying the types of variables, function parameters, and return values, the code becomes more self-documenting and easier to understand.

For example, when working with complex data structures in ETL processes, type annotations make it clear what types of data each variable or function is expecting. This improves code understanding and reduces the likelihood of errors due to incorrect assumptions or misunderstandings.

Example 1: Catching Errors Early

Let’s consider an example where we need to transform data extracted from a CSV file into a specific format. We’ll use the csv-parser library for parsing the CSV file.

First, let’s install the necessary dependencies:

npm install typescript csv-parser

Next, we’ll create a TypeScript file, transform.ts, and implement the data transformation:

import * as fs from 'fs';
import * as csvParser from 'csv-parser';

interface UserData {
name: string;
age: number;
email: string;
}

function transformCSVData(csvFilePath: string): UserData[] {
const transformedData: UserData[] = [];

fs.createReadStream(csvFilePath)
.pipe(csvParser())
.on('data', (row: any) => {
transformedData.push({
name: row.name,
age: parseInt(row.age),
email: row.email,
});
})
.on('end', () => {
console.log('Data transformation completed successfully.');
})
.on('error', (error: Error) => {
console.error('An error occurred during data transformation:', error);
});

return transformedData;
}

const transformedData = transformCSVData('data.csv');
console.log(transformedData);

In this example, by using TypeScript, we can catch potential errors such as incorrectly formatted age values or missing fields during the transformation process. If the age field in the CSV file is mistakenly provided as a string instead of a number, TypeScript will throw a compile-time error, alerting us to the issue and preventing the code from being executed.

This early error detection helps catch potential issues early on and allows developers to address them before they cause problems in downstream processes.