My QA Projects

QA Projects I was involded.

View on GitHub

1. Load the data

import pandas as pd
df = pd.read_csv("data/2015.csv")

2. Initial Exploration

print(df.head())  # View the first few rows
print(df.columns)  # Get column names
print(df.info())  # Summary of data types and missing values
print(df.describe())  # Summary statistics

3. Handle Missing Values

print(df.isnull().sum())  # Count missing values per column

Choose a strategy:

df.dropna(inplace=True)
df.fillna(0, inplace=True)
df.fillna(method='ffill', inplace=True)

4. Data Type Conversion

# 'date' to datetime
df['date'] = pd.to_datetime(df['date'])

# Price to non-numeric values
df['price'] = pd.to_numeric(df['price'], errors='coerce')  

5. Duplicates

print(df.duplicated().sum())  
df.drop_duplicates(inplace=True)

6. Outlier Handling

(I should probably use Jupyter here; I am still experimenting)

7. Standardization/Normalization

8. Encoding Categorical Features

9. Data Validation

10. Save the cleaned data

df.to_csv("cleaned_data/2015_cleaned.csv", index=False)