Developer Interface¶

Here are the main interfaces of datafuzz for general use.

Dataset class¶

class datafuzz.DataSet(input_obj, **kwargs)[source]¶

DataSet objects are used as the primary datatype for passing around data in datafuzz.

If pandas is installed, it will use dataframes to load and transform data; otherwise, it will use a list. You can also specify to not use pandas by passing keyword argument pandas=False.

Supported inputs are JSON and CSV files, numpy 2D arrays, sql queries (you must pass a db_uri keyword argument and a query argument), pandas DataFrames and Python lists (of dictionaries or lists).

Attributes:

DATA_TYPES (str): list of possible datatypes (pandas, numpy, list). FILE_REGEX (str): regex to find file name USE_PANDAS(bool): boolean that determines whether pandas is

installed and also OK to use (no pandas=False)

records (list): data records for input (obj): initial input for dataset

(can be dataframe, list, numpy array, filename or sql)

output (str): output

(if specified, can be dataframe, list,: numpy array, filename or sql)

original (obj): copy of input which won’t be modified data_type (str): dataset datatype (pandas, numpy, list). db_uri (str): dataset database connection string

(required only if using sql as input or output)

query (str): dataset database select query string: (required only if using sql as input)
table (str): dataset database output table: (required only if using sql as output)

append(rows)[source]¶

Append rows to DataSet records

Arguments:

rows (list): rows to add or concatenate

TODO:

is a shuffle needed?
should the index be maintained or reordered
should new indexes be ordered or not

column_agg(column, agg_func)[source]¶

Perform aggregate function on given column

Arguments:: column (int): column index agg_func (function): aggregate function to perform on column

Returns aggregate result

Example:

dataset.column_agg(3, min)

column_dtype(column)[source]¶

Return dtype of column

Arguments:

column (int): column index

Return:

data type of the column

TODO:

determine smart way to test more than one row for a list?

column_idx(column)[source]¶

Return numeric index of a column

NOTE: if column is not found, raises an AttributeError

input_filename¶

Return filename if input follows proper file format file://[absolute or relative filepath]

NOTE: this will raise an exception if the file is not found

output_filename¶: Return filename if output follows proper file format file://[absolute or relative filepath]

sample(percentage, columns=False)[source]¶

Get a sample from the dataset.

Arguments:: percentage (float): percentage of dataset to sample should be a value from 0.0-1.0
Kwargs:: columns (bool): option to sample columns from dataset default is False
Returns:: A sample from the dataset with matching datatype

to_output()[source]¶

Transform DataSet records to output. This uses helper method obj_to_output located in output/helpers.py

Returns output object or filepath.

validate_db()[source]¶

Validate that proper variables are set and a connection can be established with the database if either input or output are set to sql.

This will raise an exception if validation fails.

validate_parsed()[source]¶

Validate if data was properly parsed. This tests:

valid data types

records properly parsed and set to self.records

self.original exists

It will raise an exception if the validation fails.

Strategy classes¶

class datafuzz.strategy.Strategy(dataset, **kwargs)[source]¶

Strategy objects apply predefined noise and fuzz to datasets.

Parameters:

dataset (datafuzz.DataSet): dataset to noise / alter

Kwargs:

percentage (int) : percentage to distort (0-100): If none given, default to 30

Attributes:

dataset (datafuzz.DataSet): dataset to noise / alter percentage (float) : percentage to distort (0-1)

NOTE: each strategy type may have additional required keyword arguments

see also: duplicator.Duplicator, noise.NoiseMaker and fuzz.Fuzzer

apply_func_to_column(function, column, dataset=None)[source]¶

Apply a function to a column in a given dataset.

(this should work as uniformly as possible across data types)

Arguments:

function (lambda or other func): function to apply column (int): column index

Kwargs:

dataset (dataset.DataSet): dataset to use: defaults to self.dataset

Returns:

None

Note: This performs transformations on dataset.records in place.

get_numeric_columns(columns)[source]¶

Ensure columns are numeric, this will get indexes of string column names (i.e. Pandas columns or dict keys)

Arguments:: columns (list of str or int): column list
Returns:: columns (list of int): column list (only ints)

num_rows¶

return number of rows to transform in dataset based on given percentage.

NOTE: this uses rounding so only whole numbers are returned.

class datafuzz.duplicator.Duplicator(dataset, **kwargs)[source]¶

Duplicator is used to duplicate rows in a dataset

Parser classes¶

class datafuzz.parsers.StrategyYAMLParser(file_path)[source]¶

Strategy YAML Parser is used to parse strategies and fuzz / transform data using a simple YAML definition.

Output classes¶

class datafuzz.output.CSVOutput(dataset, **kwargs)[source]¶

CSV output for writing datasets to CSV file.

Generator classes¶

class datafuzz.generators.DatasetGenerator(schema_parser)[source]¶

DatasetGenerator creates a dataset when given a parsed YAML or series of CLI arguments or passed arguments.

Attributes:

parser (parsers.SchemaYAMLParser,: parsers.SchemaCLIParser or dict): parsed YAML, CLI or

dict with necessary keys

output (str): output description num_rows (int): number of rows to generate timeseries (bool): whether to generate a timeseries records (list): generated data fake (faker.Faker): Faker object to generate data

Parser parameters:

schema (dict): dictionary of column names and

values to use (i.e. {‘foo’: ‘faker.name’,…})

num_rows (int): number of rows to generate start_time (datetime): datetime to start from if timeseries increments (str): if timeseries, you may define the

type of increments you want based on a series of string choices (‘days’, ‘hours’, ‘seconds’, ‘random’)

end_time (datetime): optional end date if timeseries