Welcome to datafuzz’s documentation!¶
Motivation¶
The goal of datafuzz
is to give you the ability to test your data science code and models with BAD data. Why?? Because sometimes your code will see bad data, especially if you are running it in production. datafuzz
is motivated by the idea that testing data pipelines, data science code and production-facing models should involve some elements of fuzzing – or introducing bad and random data to determine possible security and crash risks.
Features¶
- Transform your data by adding noise to a subset of your rows
- Duplicate data to test your duplication handling
- Generate synthetic data for use in your testing suite
- Insert random “dumb” fuzzing strategies to see how your tools cope with bad data
- Seamlessly handle normal input and output types including CSVs, JSON, SQL, numpy and pandas
Installation¶
Stable release¶
To install datafuzz, run this command in your terminal:
$ pip install datafuzz
This is the preferred method to install datafuzz, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for datafuzz can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/kjam/datafuzz
Or download the tarball:
$ curl -OL https://github.com/kjam/datafuzz/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
Quickstart¶
Want to get started right away? Here is a five minute or less tutorial on using datafuzz
.
Defining YAML Strategies¶
One of the easiest ways to get started using datafuzz
is to define strategies in YAML format given a dataset you already have. Let’s take a look at an example YAML file:
data: input: 'file://datafuzz/examples/data/sales_data.csv' output: 'file://datafuzz/examples/data/sales_data_with_dupes.csv' strategies: - type: 'duplication' percentage: 10 add_noise: 1
In this file, you set up the data input
and output
, which are both CSVs. Then, you apply strategies, which is only one item in this case. This calls for duplication of records using 10% of rows and adds random noise to those duplicate rows.
We can now run these transformations from the command line.
$ datafuzz run datafuzz/examples/read_csv_and_dupe.yaml
When complete, we can check the difference in number of lines for the files.
$ wc -l datafuzz/examples/data/sales_data*
2001 datafuzz/examples/data/sales_data.csv
2201 datafuzz/examples/data/sales_data_with_dupes.csv
4202 total
That’s it! For more information on all available strategies, check out Strategies.
Generation and Noise in Jupyter¶
Are you using Jupyter notebooks for your development? datafuzz can easily integrate with your workflow.
To get started, take a look at the example notebooks in the repository.
Generating Synthetic Data¶
Generating synthetic data to use is easy as Py with datafuzz
. An easy schema definition can be declared using simple YAML:
num_rows: 200 output: 'file:///tmp/iot.csv' timeseries: start_time: 2015-01-01T00:00:00 schema: username: faker.user_name temperature: range(5,30) heartrate: range(60,90) build: faker.uuid4 latest: [0, 0, 1] note: ['interval', 'sleep', 'wake', 'update', 'user', 'test', 'n/a']
This file declares some useful schema, such as the number of rows to generate (num_rows
), timeseries information (which is only required if you want a timeseries to be generated) and the schema for each row. You can use ranges, aranges, lists or faker providers (see faker provider documentations).
To generate the data, you can run the command line:
$ datafuzz generate datafuzz/examples/yaml_files/iot_schema.yaml
To see our generated data, we can peek at the output file:
$ head -n 5 /tmp/iot.csv
build,temperature,username,heartrate,latest,note
59803106-7fa4-5fe3-2ad8-0e962c4e5666,13,rramirez,86,0,n/a
d865fbc7-d43a-e001-ea67-d1892c26aa41,26,kristi42,72,0,n/a
535f4f08-ca2b-c418-081b-bc8e572087e9,7,jacksonterri,88,1,n/a
69e2796f-f2a2-f139-1b06-cbc500cb387b,6,eerickson,75,0,wake
Frequently Asked Questions¶
Why would I want to mess up my dataset?
datafuzz
is not to be used for every data science problem, but there are several where adding noise, nulls, fuzz or duplication can help you test and determine the resiliency of your model, pipeline or data processing script. It is built with these use cases in mind, so you can break your code before someone does it (intenionally or otherwise).Why not use
Hypothesis
?Why use just one?
Hypothesis
is a great tool, which I recommend for all data scientists for property-based testing. But hypothesis is not a tool for adding noise to an already compiled (or synthetically compiled dataset). Hypothesis can be used to generate a series of property-defined examples for your pipeline or use case; however, if you want to test for unexpected types or for realistic looking noise using your already defined dataset, it becomes difficult and cumbersome. This is one of the reasons I originally builtdatafuzz
. For this reason, I think it is useful to have more than one tool for your data science testing needs.Why doesn’t
datafuzz
have X feature?It’s likely I didn’t think it should be included in the initial scope. That said, I am all for determining good future features and welcome well-described and simple requests (as well as pull requests!). Head on over to the GitHub Issues to see if the feature is already in the works or open a new Issue to start the conversation. For more details on contributing, see the contributing guide.
What is fuzzing? Why use it in data science?
Fuzz testing tests bad or malicious inputs and determines if the program crashes or raises errors. It is often used in the security community to investigate potential risks like buffer or stack overflows. Why use fuzzing for data science? Like software, data science code is often exposed to user input or outside APIs. Because of this, it is vulnerable to some of the same issues and attacks seen in web services. Even if the attacks or bad data are not malicious or are created by a bug in an internal system, we should test how the data science application, code or model behaves when given corrupt, noisy or duplicate data. Even if the expected behavior is a crash, we should know and test that in advance. This also helps determine if your extraction (ETL) or processing code (pipelines and workflows) address the issues or raise warnings when they see unexpected values.
Read More: