Open-Source Data Quality Testframework for PySpark by Tomer Gabay (Aug, 2024)

SeniorTechInfo
1 Min Read

Measure and report your data quality with ease

Every data scientist knows the classic saying “garbage in, garbage out”. Therefore, it is essential to measure the quality of your data.

At Woonstad Rotterdam, a Dutch social housing association, they utilize PySpark in Databricks for their ETL process. Data from external software suppliers is loaded into their datalake using APIs. However, not all software suppliers test data quality, which can lead to severe consequences in the social housing sector. Issues can range from tenants being unable to apply for allowances to rents being set at prices that violate the Affordable Rent Act. To address this, Woonstad Rotterdam has developed a data quality test framework for PySpark DataFrames to effectively report data quality to suppliers and data users.

Consequences of faulty data in the social housing sector can be significant, ranging from tenants being unable to apply for allowances to rents being set at prices that are illegal according to the Affordable Rent Act.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *