Welcome to datafuzz’s documentation!¶

Motivation¶

The goal of datafuzz is to give you the ability to test your data science code and models with BAD data. Why?? Because sometimes your code will see bad data, especially if you are running it in production. datafuzz is motivated by the idea that testing data pipelines, data science code and production-facing models should involve some elements of fuzzing – or introducing bad and random data to determine possible security and crash risks.

Features¶

Transform your data by adding noise to a subset of your rows
Duplicate data to test your duplication handling
Generate synthetic data for use in your testing suite
Insert random “dumb” fuzzing strategies to see how your tools cope with bad data
Seamlessly handle normal input and output types including CSVs, JSON, SQL, numpy and pandas

Installation¶

Stable release¶

To install datafuzz, run this command in your terminal:

$ pip install datafuzz

This is the preferred method to install datafuzz, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources¶

The sources for datafuzz can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/kjam/datafuzz

Or download the tarball:

$ curl  -OL https://github.com/kjam/datafuzz/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

Quickstart¶

Want to get started right away? Here is a five minute or less tutorial on using datafuzz.

Defining YAML Strategies¶

One of the easiest ways to get started using datafuzz is to define strategies in YAML format given a dataset you already have. Let’s take a look at an example YAML file:

data:
    input: 'file://datafuzz/examples/data/sales_data.csv'
    output: 'file://datafuzz/examples/data/sales_data_with_dupes.csv'
strategies:
    - type: 'duplication'
      percentage: 10
      add_noise: 1

In this file, you set up the data input and output, which are both CSVs. Then, you apply strategies, which is only one item in this case. This calls for duplication of records using 10% of rows and adds random noise to those duplicate rows.

We can now run these transformations from the command line.

$ datafuzz run datafuzz/examples/read_csv_and_dupe.yaml

When complete, we can check the difference in number of lines for the files.

$ wc -l datafuzz/examples/data/sales_data*

datafuzz/examples/data/sales_data.csv
datafuzz/examples/data/sales_data_with_dupes.csv
total

That’s it! For more information on all available strategies, check out Strategies.

Generation and Noise in Jupyter¶

Are you using Jupyter notebooks for your development? datafuzz can easily integrate with your workflow.

To get started, take a look at the example notebooks in the repository.

Generating Synthetic Data¶

Generating synthetic data to use is easy as Py with datafuzz. An easy schema definition can be declared using simple YAML:

num_rows: 200
output: 'file:///tmp/iot.csv'
timeseries:
    start_time: 2015-01-01T00:00:00 
schema: 
    username: faker.user_name
    temperature: range(5,30)
    heartrate: range(60,90)
    build: faker.uuid4
    latest: [0, 0, 1]
    note: ['interval', 'sleep', 'wake', 'update', 'user', 'test', 'n/a']

This file declares some useful schema, such as the number of rows to generate (num_rows), timeseries information (which is only required if you want a timeseries to be generated) and the schema for each row. You can use ranges, aranges, lists or faker providers (see faker provider documentations).

To generate the data, you can run the command line:

$ datafuzz generate datafuzz/examples/yaml_files/iot_schema.yaml

To see our generated data, we can peek at the output file:

$ head -n 5 /tmp/iot.csv

build,temperature,username,heartrate,latest,note
59803106-7fa4-5fe3-2ad8-0e962c4e5666,13,rramirez,86,0,n/a
d865fbc7-d43a-e001-ea67-d1892c26aa41,26,kristi42,72,0,n/a
535f4f08-ca2b-c418-081b-bc8e572087e9,7,jacksonterri,88,1,n/a
69e2796f-f2a2-f139-1b06-cbc500cb387b,6,eerickson,75,0,wake

Frequently Asked Questions¶

Why would I want to mess up my dataset?

datafuzz is not to be used for every data science problem, but there are several where adding noise, nulls, fuzz or duplication can help you test and determine the resiliency of your model, pipeline or data processing script. It is built with these use cases in mind, so you can break your code before someone does it (intenionally or otherwise).
Why not use Hypothesis?

Why use just one? Hypothesis is a great tool, which I recommend for all data scientists for property-based testing. But hypothesis is not a tool for adding noise to an already compiled (or synthetically compiled dataset). Hypothesis can be used to generate a series of property-defined examples for your pipeline or use case; however, if you want to test for unexpected types or for realistic looking noise using your already defined dataset, it becomes difficult and cumbersome. This is one of the reasons I originally built datafuzz. For this reason, I think it is useful to have more than one tool for your data science testing needs.
Why doesn’t datafuzz have X feature?

It’s likely I didn’t think it should be included in the initial scope. That said, I am all for determining good future features and welcome well-described and simple requests (as well as pull requests!). Head on over to the GitHub Issues to see if the feature is already in the works or open a new Issue to start the conversation. For more details on contributing, see the contributing guide.
What is fuzzing? Why use it in data science?

Fuzz testing tests bad or malicious inputs and determines if the program crashes or raises errors. It is often used in the security community to investigate potential risks like buffer or stack overflows. Why use fuzzing for data science? Like software, data science code is often exposed to user input or outside APIs. Because of this, it is vulnerable to some of the same issues and attacks seen in web services. Even if the attacks or bad data are not malicious or are created by a bug in an internal system, we should test how the data science application, code or model behaves when given corrupt, noisy or duplicate data. Even if the expected behavior is a crash, we should know and test that in advance. This also helps determine if your extraction (ETL) or processing code (pipelines and workflows) address the issues or raise warnings when they see unexpected values.