Quickstart¶
Want to get started right away? Here is a five minute or less tutorial on using datafuzz
.
Defining YAML Strategies¶
One of the easiest ways to get started using datafuzz
is to define strategies in YAML format given a dataset you already have. Let’s take a look at an example YAML file:
data: input: 'file://datafuzz/examples/data/sales_data.csv' output: 'file://datafuzz/examples/data/sales_data_with_dupes.csv' strategies: - type: 'duplication' percentage: 10 add_noise: 1
In this file, you set up the data input
and output
, which are both CSVs. Then, you apply strategies, which is only one item in this case. This calls for duplication of records using 10% of rows and adds random noise to those duplicate rows.
We can now run these transformations from the command line.
$ datafuzz run datafuzz/examples/read_csv_and_dupe.yaml
When complete, we can check the difference in number of lines for the files.
$ wc -l datafuzz/examples/data/sales_data*
2001 datafuzz/examples/data/sales_data.csv
2201 datafuzz/examples/data/sales_data_with_dupes.csv
4202 total
That’s it! For more information on all available strategies, check out Strategies.
Generation and Noise in Jupyter¶
Are you using Jupyter notebooks for your development? datafuzz can easily integrate with your workflow.
To get started, take a look at the example notebooks in the repository.
Generating Synthetic Data¶
Generating synthetic data to use is easy as Py with datafuzz
. An easy schema definition can be declared using simple YAML:
num_rows: 200 output: 'file:///tmp/iot.csv' timeseries: start_time: 2015-01-01T00:00:00 schema: username: faker.user_name temperature: range(5,30) heartrate: range(60,90) build: faker.uuid4 latest: [0, 0, 1] note: ['interval', 'sleep', 'wake', 'update', 'user', 'test', 'n/a']
This file declares some useful schema, such as the number of rows to generate (num_rows
), timeseries information (which is only required if you want a timeseries to be generated) and the schema for each row. You can use ranges, aranges, lists or faker providers (see faker provider documentations).
To generate the data, you can run the command line:
$ datafuzz generate datafuzz/examples/yaml_files/iot_schema.yaml
To see our generated data, we can peek at the output file:
$ head -n 5 /tmp/iot.csv
build,temperature,username,heartrate,latest,note
59803106-7fa4-5fe3-2ad8-0e962c4e5666,13,rramirez,86,0,n/a
d865fbc7-d43a-e001-ea67-d1892c26aa41,26,kristi42,72,0,n/a
535f4f08-ca2b-c418-081b-bc8e572087e9,7,jacksonterri,88,1,n/a
69e2796f-f2a2-f139-1b06-cbc500cb387b,6,eerickson,75,0,wake