A free, open-source Python tool that automatically detects and fixes the most common data quality issues in CSV files — so you can spend less time cleaning data and more time analyzing it.
Every data analyst knows the pain: you get a CSV export from a CRM, ERP, or spreadsheet, and before you can do anything useful with it, you spend hours fixing the same issues over and over — duplicate rows, blank fields, inconsistent date formats, text in number columns, trailing spaces, and mixed-case entries.
Studies show that data professionals spend up to 80% of their time on data preparation. Most of that time goes to the same repetitive cleaning tasks that could be automated.
This tool handles those common issues automatically. Run it on any CSV file and get a clean, validated output — plus a detailed report of everything that was found and fixed.
Identifies and removes exact duplicate rows. Also flags near-duplicates based on key columns you specify — like matching names with different phone numbers.
Detects blank cells and empty strings. Offers configurable strategies: drop rows, fill with defaults, fill with column mean/median, or forward-fill from previous rows.
Checks that numbers are actually numbers, dates are valid dates, and emails match expected formats. Flags mismatched types with row and column references.
Strips leading/trailing whitespace, standardizes case (upper, lower, title), removes non-printable characters, and normalizes Unicode — so "New York", "new york", and " NEW YORK " all become the same value.
Detects mixed date formats (MM/DD/YYYY vs. DD-MM-YYYY) and standardizes them. Fixes number formatting issues like commas as decimal separators or currency symbols in numeric columns.
Generates a detailed report showing every issue found, what was fixed, and what needs manual review — with row numbers and column names for easy reference.
Run the script from the command line or import it into your own Python code. Pass in your CSV file path and an optional config file to customize which checks to run.
python csv_cleaner.py sales_data.csv --output cleaned_sales.csv
The tool scans every column and row, profiling data types, checking for patterns, and applying fixes based on your configuration. Default settings handle the most common issues out of the box.
# Example config options
{
"remove_duplicates": true,
"handle_missing": "fill_median",
"standardize_dates": "YYYY-MM-DD",
"strip_whitespace": true,
"validate_emails": true,
"case_format": "title"
}
You get two files: a cleaned CSV ready for analysis, and a validation report summarizing every issue that was found and how it was handled.
--- Validation Report ---
Rows processed: 12,847
Duplicates removed: 342
Missing values fixed: 1,203
Type errors flagged: 89
Text standardized: 4,521
Date formats fixed: 2,104
Output saved to: cleaned_sales.csv
Clean messy CSV exports from CRMs, ERPs, and databases before loading into Excel, Google Sheets, or BI tools.
Fix customer lists, product catalogs, and transaction exports without needing to learn Python or hire a developer.
Use as a preprocessing step in ETL pipelines to validate and clean incoming CSV data before loading into databases.
Quickly clean survey data, research datasets, and public data downloads for academic projects and analysis.