Created: February 12, 2020 / Updated: February 15, 2020 / Status: finished / 2 min read (~275 words)
I want my clients to share with me confidential data without revealing what the exact values are so that I can train machine learning models on this data.
I wrote a simple python package that uses pandas and scikit-learn to apply some simple transforms to the data. Some transforms that are applied to the dataset can change the distribution of the data, changing its statistical properties, while others preserve them but simply rescale the domain.
Given an anonymizer dataset using this tool, it is possible to do a preliminary data audit and possibly train machine learning models on the data to give a quick idea to clients whether their data looks promising or not without actually revealing the true numbers (except if desired).
The main concern with this approach is that most clients are not technical, and thus having them anonymize their data is generally not easy, if not impossible. Thus it means that such a tool is currently not applicable in the desired context.