Usage

To use datafuzz in a project, you have multiple options. You can use the Command Line Interface or with your normal Python script or Jupyter notebooks.

CLI

The easiest and fastest way to get started using datafuzz is via the command line interface, or CLI. There are a few ways to do so. First, you should determine if you need to generate data or simply modify data you have.

Generate command

If you need to generate data, you should use the cli generate command. This has two options:

  • Utilize a YAML file which defines the schema for your synthetic data.
  • Pass descriptions in via command line flags (not recommended for long or complex schema as this is not easily maintainable).

A good example of the YAML usage is included in the Quickstart.

Let’s take a look at how to use the command line flags:

$ datafuzz generate --non-yaml -h
    usage: datafuzz [-h] [-f FIELDS] [-v VALUES] [-o OUTPUT] [-n NUM_ROWS]
                      [--start_time START_TIME] [--end_time END_TIME]
                      [--increments {hours,seconds,days,random}]
                      {generate}

      Generate dataset: to use

      positional arguments:
            {generate}

      optional arguments:
            -h, --help            show this help message and exit
            -f FIELDS, --fields FIELDS
                            semicolon-delimited string of field names
            -v VALUES, --values VALUES
                            semicolon-delimited string of values.This can be a mix
                            of faker types and ranges
            -o OUTPUT, --output OUTPUT
                    what output to use
            -n NUM_ROWS, --num_rows NUM_ROWS
                    number of rows to generate
            --start_time START_TIME
                    start time of timeseries in isoformat:YYYY-MM-
                    DDThh:mm:ss
            --end_time END_TIME   end time of timeseries in isoformat: YYYY-MM-
                    DDThh:mm:ss
            --increments {hours,seconds,days,random}
                    how to increment entries

To specify we aren’t using YAML we pass a --non-yaml flag, which allows us to access the CLI parsers. For generation, we see a long list of possible options, let’s try a few!:

$ datafuzz generate -f 'name;age;city' -v 'faker.name;range(30,40);faker.city' -n 200 -o file://friends.csv

dataset now available at friends.csv

Now let’s check the content:

$ head -n 5 friends.csv

name,city,age
Eric Walsh,West Brandy,36
Jason Willis,Port Stephen,37
Kyle Greer,North Brandon,32
Mathew Ward,North Ginabury,32

That was easy! :)

For a review of all options you can use with the generate command, check out the Generators.

Run command

A second option might be that you want to modify data you have or data you just generated. To do so, you can use the run command. Similar to the generate command, this has two option:

  • Utilize a YAML file which defines the different transformations to run on your data
  • Pass a type of run directly into the command line (and repeat as needed)

A good example of the YAML usage is included in the Quickstart.

Let’s take a look at run with just command line options:

    $ datafuzz run --non-yaml -h

usage: datafuzz [-h] [-i INPUT] [-o OUTPUT] [-s STRATEGIES] [--db_uri DB_URI]
                  [--query QUERY] [--table TABLE]
                  {run}

  Apply datafuzz strategies to input, return output

  positional arguments:
    {run}

  optional arguments:
    -h, --help            show this help message and exit
    -i INPUT, --input INPUT
                          input string (filename or sql)
    -o OUTPUT, --output OUTPUT
                          input string (filename or sql)
    -s STRATEGIES, --strategies STRATEGIES
                          dictionary defining the strategies to take
    --db_uri DB_URI       If using database, the db URI to connect
    --query QUERY         If using db input, query to collect data
    --table TABLE         If using db output, table to insert into

Okay, let’s give it a shot with our newly generated friends.csv file:

    $ datafuzz run -i file://friends.csv -o file://fuzzy_friends.csv -s '{"type": "fuzz", "percentage": 30}'

dataset now available at fuzzy_friends.csv

And we can check our output:

$ head -n 5 fuzzy_friends.csv

    ,name,city,age
    0,Eric Walsh,b'\xef\xbb\xbf'West Brandy,36
    1,Jason Willis,Port Stephen,37
    2,Kyle Greer,North Brandon,32
    3,Mathew Ward,North Ginabury,32

And indeed, our friends now have some fuzz! For a review of all options you can use with the run command, check out the Strategies.

For a more in-depth look into datafuzz, see Developer Interface.

Using datafuzz with Python or Jupyter

You don’t need to use datafuzz with the CLI, you can also use it with your native Python scripts, frameworks or Jupyter notebooks. To see some Jupyter notebook integration examples, check out the Jupyter Notebooks included in the examples directory.

For integration with your Python script, the necessary parameters for initialization may differ depending on the class you are using to transform your data.

To do so, you might start with a dataset in the shape of a Pandas DataFrame or a numpy matrix or even a Python list of dictionaries and list. You could also generate a new dataset by using the generator class.

Let’s generate a simple timeseries using the generator:

from datafuzz.generators import DatasetGenerator

generator = DatasetGenerator({
    'output': 'pandas',
    'schema': {
        'category': list('ABCD'),
        'model': range(4,8),
        'plate': 'faker.license_plate',
        'year': range(2001, 2018),
        'color': 'faker.safe_color_name',
        'price': range(20000, 50000, 1000)
    },
    'num_rows': 1000,
})

generator.generate()

dataset = generator.to_output()

print(dataset.head())

Your output should look something like this:

  category    color  model     plate  price  year
0        A  fuchsia      4   0736 CF  20000  2003
1        B     teal      6   EXS 036  29000  2004
2        D     teal      6   1QX5388  32000  2009
3        C     navy      5  6P 15774  30000  2011
4        A    white      4   0SQ D88  31000  2013

Now we have a dataset that holds our generated dataframe. If instead we had imported or transformed the data into a dataframe, we can start at this step.

Now that we have some data to work with, let’s determine what transformation there are available. The strategies available are the following classes:

  • Duplicator
  • Fuzzer
  • NoiseMaker

Let’s use the NoiseMaker class to add some noise to our dataset.

from datafuzz import DataSet, NoiseMaker

dataset = DataSet(dataset,
                  output='file://my_new_file.json')

noiser = NoiseMaker(
    dataset,
    noise=['add_nulls', 'random'],
    columns=['price', 'model', 'year'],
    percentage=30,
)
noiser.run_strategy()

At this point, your DataSet object is transformed. You can check it by looking at the 5 initial items in the dataset:

print(dataset[:5])

Your data should now be a bit messy:

category    color     model     plate         price         year
0        D   maroon  4.376930   179-IYJ  29000.000000  2006.000000
1        D    green  4.000000  P 336983  15468.136372  1598.702067
2        D     gray  5.270262   DIV-042  20000.000000  2002.000000
3        C     aqua  2.017815   84R 707  38000.000000          NaN
4        C  fuchsia  6.000000   8078 TU  37000.000000  1995.355014

You can continue running transformations, if you like:

from datafuzz import Duplicator

duplicator = Duplicator(
    dataset,
    percentage=20,
)
duplicator.run_strategy()

When you are done with the transformations, you can export the data depending on the output you set when the dataset was first initiated. You can also set a new output string. Available outputs are as follows:

  • pandas dataframe: ‘pandas’
  • numpy 2D array: ‘numpy’
  • datafuzz.DataSet: ‘dataset’
  • list: ‘list’
  • CSV files: ‘file://foo.csv
  • JSON files: ‘file://foo.json
  • SQL: ‘sql’

If you use the ‘sql’ output, you need to also set a value for 'table' and for 'db_uri'. For an in-depth treatment of input and output options, please see I/O Options.

To then get the output, you need to run to_output:

output = dataset.to_output()

print(output)

And now to check the file:

head -c 200 my_new_file.csv

{"0":{"category":"D","color":"maroon","model":2.5774000851,"plate":"0ME 062","price":null,"year":2008.0},"1":{"category":"C","color":"black","model":null,"plate":"UGV-266","price":39000.0,"year":2010.

That covers the vast majority of the functionality contained within datafuzz. Want to see more features? Check the backlog and feel free to follow steps for contributing!

If you want to see more examples of functionality, check out the examples in the repository.

For a longer explanation of the generators and their functionality, see the Generators documentation. For the same treatment for strategies, see the Strategies documentation.

For a review of I/O functionality and options (for both parsers and DataSet output), see the I/O Options documentation.