Developer Interface

Here are the main interfaces of datafuzz for general use.

Dataset class

class datafuzz.DataSet(input_obj, **kwargs)[source]

DataSet objects are used as the primary datatype for passing around data in datafuzz.

If pandas is installed, it will use dataframes to load and transform data; otherwise, it will use a list. You can also specify to not use pandas by passing keyword argument pandas=False.

Supported inputs are JSON and CSV files, numpy 2D arrays, sql queries (you must pass a db_uri keyword argument and a query argument), pandas DataFrames and Python lists (of dictionaries or lists).

Attributes:

DATA_TYPES (str): list of possible datatypes (pandas, numpy, list). FILE_REGEX (str): regex to find file name USE_PANDAS(bool): boolean that determines whether pandas is

installed and also OK to use (no pandas=False)

records (list): data records for input (obj): initial input for dataset

(can be dataframe, list, numpy array, filename or sql)
output (str): output
(if specified, can be dataframe, list,
numpy array, filename or sql)

original (obj): copy of input which won’t be modified data_type (str): dataset datatype (pandas, numpy, list). db_uri (str): dataset database connection string

(required only if using sql as input or output)
query (str): dataset database select query string
(required only if using sql as input)
table (str): dataset database output table
(required only if using sql as output)
append(rows)[source]

Append rows to DataSet records

Arguments:
rows (list): rows to add or concatenate
TODO:
  • is a shuffle needed?
  • should the index be maintained or reordered
  • should new indexes be ordered or not
column_agg(column, agg_func)[source]

Perform aggregate function on given column

Arguments:
column (int): column index agg_func (function): aggregate function to perform on column

Returns aggregate result

Example:

dataset.column_agg(3, min)
column_dtype(column)[source]

Return dtype of column

Arguments:
column (int): column index
Return:
data type of the column
TODO:
  • determine smart way to test more than one row for a list?
column_idx(column)[source]

Return numeric index of a column

NOTE: if column is not found, raises an AttributeError

input_filename

Return filename if input follows proper file format file://[absolute or relative filepath]

NOTE: this will raise an exception if the file is not found

output_filename

Return filename if output follows proper file format file://[absolute or relative filepath]

sample(percentage, columns=False)[source]

Get a sample from the dataset.

Arguments:
percentage (float): percentage of dataset to sample should be a value from 0.0-1.0
Kwargs:
columns (bool): option to sample columns from dataset default is False
Returns:
A sample from the dataset with matching datatype
to_output()[source]

Transform DataSet records to output. This uses helper method obj_to_output located in output/helpers.py

Returns output object or filepath.

validate_db()[source]

Validate that proper variables are set and a connection can be established with the database if either input or output are set to sql.

This will raise an exception if validation fails.

validate_parsed()[source]

Validate if data was properly parsed. This tests:

  • valid data types
  • records properly parsed and set to self.records
  • self.original exists

It will raise an exception if the validation fails.

Strategy classes

class datafuzz.strategy.Strategy(dataset, **kwargs)[source]

Strategy objects apply predefined noise and fuzz to datasets.

Parameters:
dataset (datafuzz.DataSet): dataset to noise / alter
Kwargs:
percentage (int) : percentage to distort (0-100)
If none given, default to 30
Attributes:
dataset (datafuzz.DataSet): dataset to noise / alter percentage (float) : percentage to distort (0-1)

NOTE: each strategy type may have additional required keyword arguments

see also: duplicator.Duplicator, noise.NoiseMaker and fuzz.Fuzzer

apply_func_to_column(function, column, dataset=None)[source]

Apply a function to a column in a given dataset.

(this should work as uniformly as possible across data types)

Arguments:
function (lambda or other func): function to apply column (int): column index
Kwargs:
dataset (dataset.DataSet): dataset to use
defaults to self.dataset
Returns:
None

Note: This performs transformations on dataset.records in place.

get_numeric_columns(columns)[source]

Ensure columns are numeric, this will get indexes of string column names (i.e. Pandas columns or dict keys)

Arguments:
columns (list of str or int): column list
Returns:
columns (list of int): column list (only ints)
num_rows

return number of rows to transform in dataset based on given percentage.

NOTE: this uses rounding so only whole numbers are returned.

class datafuzz.duplicator.Duplicator(dataset, **kwargs)[source]

Duplicator is used to duplicate rows in a dataset

see also: strategy.Strategy

noise(sample)[source]

Adds noise to the duplicate rows

Parameteres:
sample (list or obj): dataset.Dataset.sample
Returns
sample (list or obj): distorted rows
TODO:
  • implement more noise options than just random
run_strategy()[source]

Run duplicator strategy and if add noise is selected, add noise to the data before appending it to the dataset.

class datafuzz.fuzz.Fuzzer(dataset, **kwargs)[source]

Fuzzer is used as a strategy to add “dumb” fuzzing methods (i.e. random bad values). These transformations are mainly based on column type.

see also: strategy.Strategy

fuzz_date()[source]

Return random choice from date fuzz helpers.

Possible transformations:
  • shift_time: shift the time by a random amount
  • date_to_str: transform to string
fuzz_numeric()[source]

Return a random choice from the numeric fuzz helpers.

Possible transformations:
  • nanify: insert null values (sometimes strs)
  • bigints: return big magic numbers
  • hexify: return hex value
fuzz_random()[source]

Return a random choice from the random fuzz helpers.

Possible transformations:
  • sql: returns unkind sql
  • metachars: inserts metacharacters
  • files: returns filepaths or bash
  • delimiter: inserts multiple delimiters
  • emoji: inserts one random emoji
fuzz_str()[source]

Return random choice from string fuzz helpers.

Possible transformations:
  • add_format: insert format strings
  • change_encoding: decode with possibly bad encoding
  • to_bytes: transform to bytes
  • insert_boms: insert utf-8 boms
run_strategy()[source]

Apply fuzz methods to chosen columns.

For now, this applies a mixture of random and column type based transformations.

See Fuzzer.fuzz_str, Fuzzer.fuzz_random and Fuzzer.fuzz_numeric for full list of possible transformations.

class datafuzz.noise.NoiseMaker(dataset, **kwargs)[source]

NoiseMaker applies noisy data transformations to given dataset.

see also strategy.Strategy

nullify()[source]

Set null values for sample in columns

randomize()[source]

Set random values for sample in columns

NOTE: this will vary based on column type

run_strategy()[source]

Run noise strategy on sample

Performs transformations on self.dataset

set_value(value, column=None)[source]

Set value for a series of columns or one column.

Arguments:
value (obj): value to set
Kwargs:
column (str or int): name or index of column
TODO:
  • should this be available on Strategy class?
string_permutation(column=None)[source]

Permute string values for sample in columns

TODO:
  • add permutations for missing characters
  • flipped strings
  • typos
  • homonyms / autocorrect
type_transform()[source]

Transform types for sample in columns.

NOTE: if a string column is used and the values cannot be transformed into integer or float values, you may not see a useful transformation.

TODO:
  • for strings, should numeric values be inserted as strings

instead?

use_range()[source]

Use values from a range to set values in columns

If limits not passed during initialization, this method will attempt to determine good limits based on the column ranges and use those.

NOTE: range is only available for numeric columns

TODO:
  • should we calculate IQR and insert outliers?
  • if not, should add_outliers be a new option for noise?

Parser classes

class datafuzz.parsers.StrategyYAMLParser(file_path)[source]

Strategy YAML Parser is used to parse strategies and fuzz / transform data using a simple YAML definition.

see also parsers.core.BaseYAMLParser

db_uri

Return data db_uri from parsed YAML

execute()[source]

Execute strategies from parsed YAML

input

Return data input from parsed YAML

output

Return data output from parsed YAML

query

Return data query from parsed YAML

strategies

Return strategies from parsed YAML

table

Return data table from parsed YAML

class datafuzz.parsers.StrategyCLIParser(**kwargs)[source]

Strategy YAML CLI is used to parse strategies and fuzz / transform data using a simple CLI definition.

execute()[source]

execute fuzzing strategies from parser

Returns:
output
init_parser()[source]

Initialize parser with required and optional arguments

Returns:
argparse.ArgumentParser
parse_args(argv=None)[source]

Parse arguments and validate them

Kwargs:
argv (sys.argv or similar list)
print_help()[source]

print parser help

validate_arguments()[source]

Validate that all required fields are submitted

class datafuzz.parsers.SchemaYAMLParser(file_name)[source]

Schema YAML Parser is used generate data using a simple YAML definition.

see also parsers.core.BaseYAMLParser

execute()[source]

generate data using parsed YAML

Returns:
output
num_rows

Return num_rows from parsed YAML

output

Return output from parsed YAML

parse_timeseries()[source]

Parse and set values related to timeseries

raises SyntaxError if start or end time were not properly parsed

schema

Return schema from parsed YAML

timeseries

Return timeseries from parsed YAML

validate_yaml()[source]

Validate that all required fields are parsed from YAML

raises SyntaxError if required field missing

class datafuzz.parsers.SchemaCLIParser(**kwargs)[source]

Schema Parser for CLI Input This generates a argparser to parse input and can be used to then generate the dataset

execute()[source]

Generates data from CLI parsed arguments

Returns:
output
init_parser()[source]

Generate argparse.ArgumentParser to use for parsing arguments

parse_args(argv=None)[source]

Parse arguments and validate them

Kwargs:
argv (sys.argv or similar list)
print_help()[source]

print parser help

validate_arguments()[source]

Validate that all required fields are submitted

Output classes

class datafuzz.output.CSVOutput(dataset, **kwargs)[source]

CSV output for writing datasets to CSV file.

see also: datafuzz.output.BaseOutput

to_csv()[source]

Write the CSVOutput to a csv file

class datafuzz.output.JSONOutput(dataset, **kwargs)[source]

JSON output for writing datasets to JSON file.

see also: datafuzz.output.BaseOutput

to_json()[source]

Write the JSONOutput to a json file

class datafuzz.output.SQLOutput(dataset, **kwargs)[source]

Database output for writing datasets to a table.

see also: datafuzz.output.BaseOutput

Extra parameters:
db_uri (str): Database URI String table (str): Database table name
to_sql()[source]

Write the dataset records to a sql table

datafuzz.output.obj_to_output(obj)[source]

Transform DataSet or generator records to output

supported outputs:
dataset, pandas, numpy, list, csv and json (specify file://$NAME.csv or file://$NAME.json) and sql (specify db_uri and table)

NOTE: will raise exception if unsupported output set

Generator classes

class datafuzz.generators.DatasetGenerator(schema_parser)[source]

DatasetGenerator creates a dataset when given a parsed YAML or series of CLI arguments or passed arguments.

Attributes:
parser (parsers.SchemaYAMLParser,

parsers.SchemaCLIParser or dict): parsed YAML, CLI or

dict with necessary keys

output (str): output description num_rows (int): number of rows to generate timeseries (bool): whether to generate a timeseries records (list): generated data fake (faker.Faker): Faker object to generate data

Parser parameters:

schema (dict): dictionary of column names and
values to use (i.e. {‘foo’: ‘faker.name’,…})

num_rows (int): number of rows to generate start_time (datetime): datetime to start from if timeseries increments (str): if timeseries, you may define the

type of increments you want based on a series of string choices (‘days’, ‘hours’, ‘seconds’, ‘random’)

end_time (datetime): optional end date if timeseries

see also parsers.core

generate()[source]

Generate the dataset (self.records) based on the given schema.

If a timeseries is selected, this method will pass to Generator.generate_timeseries

generate_row()[source]

Generate a row based on the parsed schema

NOTE: this uses string matching to determine if the schema is a list, faker definition or one of the matching patterns in EVAL_REGEX and then generates data based on those predefined selections.

generate_timeseries()[source]

Generate a timeseries with a timestamp column.

This uses the parser start date and increments to

NOTE: a warning will be logged if there is an endtime given and the number of rows is not reached before the endtime. Endtime takes precedence if specified.

TODO: should num_rows take precedence over end time?

increment_time()[source]

For timeseries generation, increment the start time given the parser requirements:

  • hours, days, seconds
  • if not set, returns random mix

TODO: allow for set interval?

output_filename

Return filename if output follows proper file format file://[absolute or relative filepath]

to_output()[source]

Return or create output based on parsed schema.

For more use cases, please reference the Usage section.