SQL Automation for Data Cleaning: A Case Study on U.S. Household Income
- Kwnstantinos Lambrou
- Jun 29
- 2 min read
Updated: Jul 15

Project Overview
In this SQL automation project, we design a full cleaning pipeline to prepare U.S. household income data for analytics. We automate duplicate removal, perform data standardization, and ensure consistent formatting for key fields such as state names, place types, and country names. The entire process is encapsulated in a stored procedure, which can be scheduled or triggered for continuous data hygiene.
Step-by-Step Breakdown
This procedure:
Creates a new cleaned table (if not already created),
Copies over data,
Deduplicates again based on id + Timestamp,
Applies multiple standardization rules.
DELIMITER $$
DROP PROCEDURE IF EXISTS Copy_and_Clean_Data;
CREATE PROCEDURE Copy_and_Clean_Data()
BEGIN
CREATE TABLE IF NOT EXISTS `us_house_hold_income_Cleaned` (
`row_id` INT, `id` INT, `State_Code` INT, `State_Name` TEXT,
`State_ab` TEXT, `Country` VARCHAR(255), `City` TEXT, `Place` TEXT,
`Type` TEXT, `Primary` TEXT, `Zip_Code` INT, `Area_Code` INT,
`ALand` INT, `AWater` INT, `Lat` DOUBLE, `Lon` DOUBLE,
`TimeStamp` TIMESTAMP
);
INSERT INTO us_house_hold_income_Cleaned
SELECT *, CURRENT_TIMESTAMP
FROM ushouseholdincome;
DELETE FROM us_house_hold_income_Cleaned
WHERE row_id IN (
SELECT row_id FROM (
SELECT row_id,
ROW_NUMBER() OVER(PARTITION BY id, `Timestamp` ORDER BY id, `Timestamp`) AS row_num
FROM us_house_hold_income_Cleaned
) AS duplicates
WHERE row_num > 1
);
UPDATE us_house_hold_income_Cleaned
SET State_Name = 'Georgia'
WHERE State_Name = 'georia';
UPDATE us_house_hold_income_Cleaned SET Country = UPPER(Country);
UPDATE us_house_hold_income_Cleaned SET City = UPPER(City);
UPDATE us_house_hold_income_Cleaned SET Place = UPPER(Place);
UPDATE us_house_hold_income_Cleaned SET State_Name = UPPER(State_Name);
UPDATE us_house_hold_income_Cleaned SET `Type` = 'CDP' WHERE `Type` = 'CPD';
UPDATE us_house_hold_income_Cleaned SET `Type` = 'Borough' WHERE `Type` = 'Boroughs';
END $$
DELIMITER ;
-- Run the procedure
CALL Copy_and_Clean_Data();
Automate with Scheduled Event
CREATE EVENT run_data_cleaning
ON SCHEDULE EVERY 30 DAY
DO
CALL Copy_and_Clean_Data();
This event ensures that cleaning is re-applied every 30 days — great for pipelines where new data arrives regularly.
(Optional): Real-time Trigger
DELIMITER $$
CREATE TRIGGER transfer_clean_data
AFTER INSERT ON bakery.us_house_hold_income_Cleaned
FOR EACH ROW
BEGIN
CALL Copy_and_Clean_Data();
END $$
DELIMITER ;
Outcome
By automating the cleaning and standardization of this dataset, we ensure reliable, consistent, and analysis-ready income data. This modular SQL pipeline can be adapted to other datasets that require regular hygiene and structure enforcement.
Comments