Developer Interface¶
Here are the main interfaces of datafuzz for general use.
Dataset class¶
-
class
datafuzz.
DataSet
(input_obj, **kwargs)[source]¶ DataSet objects are used as the primary datatype for passing around data in datafuzz.
If pandas is installed, it will use dataframes to load and transform data; otherwise, it will use a list. You can also specify to not use pandas by passing keyword argument pandas=False.
Supported inputs are JSON and CSV files, numpy 2D arrays, sql queries (you must pass a db_uri keyword argument and a query argument), pandas DataFrames and Python lists (of dictionaries or lists).
- Attributes:
DATA_TYPES (str): list of possible datatypes (pandas, numpy, list). FILE_REGEX (str): regex to find file name USE_PANDAS(bool): boolean that determines whether pandas is
installed and also OK to use (no pandas=False)records (list): data records for input (obj): initial input for dataset
(can be dataframe, list, numpy array, filename or sql)- output (str): output
- (if specified, can be dataframe, list,
- numpy array, filename or sql)
original (obj): copy of input which won’t be modified data_type (str): dataset datatype (pandas, numpy, list). db_uri (str): dataset database connection string
(required only if using sql as input or output)- query (str): dataset database select query string
- (required only if using sql as input)
- table (str): dataset database output table
- (required only if using sql as output)
-
append
(rows)[source]¶ Append rows to DataSet records
- Arguments:
- rows (list): rows to add or concatenate
- TODO:
- is a shuffle needed?
- should the index be maintained or reordered
- should new indexes be ordered or not
-
column_agg
(column, agg_func)[source]¶ Perform aggregate function on given column
- Arguments:
- column (int): column index agg_func (function): aggregate function to perform on column
Returns aggregate result
Example:
dataset.column_agg(3, min)
-
column_dtype
(column)[source]¶ Return dtype of column
- Arguments:
- column (int): column index
- Return:
- data type of the column
- TODO:
- determine smart way to test more than one row for a list?
-
column_idx
(column)[source]¶ Return numeric index of a column
NOTE: if column is not found, raises an AttributeError
-
input_filename
¶ Return filename if input follows proper file format file://[absolute or relative filepath]
NOTE: this will raise an exception if the file is not found
-
output_filename
¶ Return filename if output follows proper file format file://[absolute or relative filepath]
-
sample
(percentage, columns=False)[source]¶ Get a sample from the dataset.
- Arguments:
- percentage (float): percentage of dataset to sample should be a value from 0.0-1.0
- Kwargs:
- columns (bool): option to sample columns from dataset default is False
- Returns:
- A sample from the dataset with matching datatype
-
to_output
()[source]¶ Transform DataSet records to output. This uses helper method obj_to_output located in output/helpers.py
Returns output object or filepath.
Strategy classes¶
-
class
datafuzz.strategy.
Strategy
(dataset, **kwargs)[source]¶ Strategy objects apply predefined noise and fuzz to datasets.
- Parameters:
- dataset (datafuzz.DataSet): dataset to noise / alter
- Kwargs:
- percentage (int) : percentage to distort (0-100)
- If none given, default to 30
- Attributes:
- dataset (datafuzz.DataSet): dataset to noise / alter percentage (float) : percentage to distort (0-1)
NOTE: each strategy type may have additional required keyword arguments
see also: duplicator.Duplicator, noise.NoiseMaker and fuzz.Fuzzer
-
apply_func_to_column
(function, column, dataset=None)[source]¶ Apply a function to a column in a given dataset.
(this should work as uniformly as possible across data types)
- Arguments:
- function (lambda or other func): function to apply column (int): column index
- Kwargs:
- dataset (dataset.DataSet): dataset to use
- defaults to self.dataset
- Returns:
- None
Note: This performs transformations on dataset.records in place.
-
get_numeric_columns
(columns)[source]¶ Ensure columns are numeric, this will get indexes of string column names (i.e. Pandas columns or dict keys)
- Arguments:
- columns (list of str or int): column list
- Returns:
- columns (list of int): column list (only ints)
-
num_rows
¶ return number of rows to transform in dataset based on given percentage.
NOTE: this uses rounding so only whole numbers are returned.
-
class
datafuzz.duplicator.
Duplicator
(dataset, **kwargs)[source]¶ Duplicator is used to duplicate rows in a dataset
see also: strategy.Strategy
-
class
datafuzz.fuzz.
Fuzzer
(dataset, **kwargs)[source]¶ Fuzzer is used as a strategy to add “dumb” fuzzing methods (i.e. random bad values). These transformations are mainly based on column type.
see also: strategy.Strategy
-
fuzz_date
()[source]¶ Return random choice from date fuzz helpers.
- Possible transformations:
- shift_time: shift the time by a random amount
- date_to_str: transform to string
-
fuzz_numeric
()[source]¶ Return a random choice from the numeric fuzz helpers.
- Possible transformations:
- nanify: insert null values (sometimes strs)
- bigints: return big magic numbers
- hexify: return hex value
-
fuzz_random
()[source]¶ Return a random choice from the random fuzz helpers.
- Possible transformations:
- sql: returns unkind sql
- metachars: inserts metacharacters
- files: returns filepaths or bash
- delimiter: inserts multiple delimiters
- emoji: inserts one random emoji
-
-
class
datafuzz.noise.
NoiseMaker
(dataset, **kwargs)[source]¶ NoiseMaker applies noisy data transformations to given dataset.
see also strategy.Strategy
-
randomize
()[source]¶ Set random values for sample in columns
NOTE: this will vary based on column type
-
set_value
(value, column=None)[source]¶ Set value for a series of columns or one column.
- Arguments:
- value (obj): value to set
- Kwargs:
- column (str or int): name or index of column
- TODO:
- should this be available on Strategy class?
-
string_permutation
(column=None)[source]¶ Permute string values for sample in columns
- TODO:
- add permutations for missing characters
- flipped strings
- typos
- homonyms / autocorrect
-
type_transform
()[source]¶ Transform types for sample in columns.
NOTE: if a string column is used and the values cannot be transformed into integer or float values, you may not see a useful transformation.
- TODO:
- for strings, should numeric values be inserted as strings
instead?
-
use_range
()[source]¶ Use values from a range to set values in columns
If limits not passed during initialization, this method will attempt to determine good limits based on the column ranges and use those.
NOTE: range is only available for numeric columns
- TODO:
- should we calculate IQR and insert outliers?
- if not, should add_outliers be a new option for noise?
-
Parser classes¶
-
class
datafuzz.parsers.
StrategyYAMLParser
(file_path)[source]¶ Strategy YAML Parser is used to parse strategies and fuzz / transform data using a simple YAML definition.
see also parsers.core.BaseYAMLParser
-
db_uri
¶ Return data db_uri from parsed YAML
-
input
¶ Return data input from parsed YAML
-
output
¶ Return data output from parsed YAML
-
query
¶ Return data query from parsed YAML
-
strategies
¶ Return strategies from parsed YAML
-
table
¶ Return data table from parsed YAML
-
-
class
datafuzz.parsers.
StrategyCLIParser
(**kwargs)[source]¶ Strategy YAML CLI is used to parse strategies and fuzz / transform data using a simple CLI definition.
-
init_parser
()[source]¶ Initialize parser with required and optional arguments
- Returns:
- argparse.ArgumentParser
-
-
class
datafuzz.parsers.
SchemaYAMLParser
(file_name)[source]¶ Schema YAML Parser is used generate data using a simple YAML definition.
see also parsers.core.BaseYAMLParser
-
num_rows
¶ Return num_rows from parsed YAML
-
output
¶ Return output from parsed YAML
-
parse_timeseries
()[source]¶ Parse and set values related to timeseries
raises SyntaxError if start or end time were not properly parsed
-
schema
¶ Return schema from parsed YAML
-
timeseries
¶ Return timeseries from parsed YAML
-
Output classes¶
-
class
datafuzz.output.
CSVOutput
(dataset, **kwargs)[source]¶ CSV output for writing datasets to CSV file.
see also: datafuzz.output.BaseOutput
-
class
datafuzz.output.
JSONOutput
(dataset, **kwargs)[source]¶ JSON output for writing datasets to JSON file.
see also: datafuzz.output.BaseOutput
-
class
datafuzz.output.
SQLOutput
(dataset, **kwargs)[source]¶ Database output for writing datasets to a table.
see also: datafuzz.output.BaseOutput
- Extra parameters:
- db_uri (str): Database URI String table (str): Database table name
-
datafuzz.output.
obj_to_output
(obj)[source]¶ Transform DataSet or generator records to output
- supported outputs:
- dataset, pandas, numpy, list, csv and json (specify file://$NAME.csv or file://$NAME.json) and sql (specify db_uri and table)
NOTE: will raise exception if unsupported output set
Generator classes¶
-
class
datafuzz.generators.
DatasetGenerator
(schema_parser)[source]¶ DatasetGenerator creates a dataset when given a parsed YAML or series of CLI arguments or passed arguments.
- Attributes:
- parser (parsers.SchemaYAMLParser,
parsers.SchemaCLIParser or dict): parsed YAML, CLI or
dict with necessary keys
output (str): output description num_rows (int): number of rows to generate timeseries (bool): whether to generate a timeseries records (list): generated data fake (faker.Faker): Faker object to generate data
Parser parameters:
- schema (dict): dictionary of column names and
- values to use (i.e. {‘foo’: ‘faker.name’,…})
num_rows (int): number of rows to generate start_time (datetime): datetime to start from if timeseries increments (str): if timeseries, you may define the
type of increments you want based on a series of string choices (‘days’, ‘hours’, ‘seconds’, ‘random’)end_time (datetime): optional end date if timeseries
see also parsers.core
-
generate
()[source]¶ Generate the dataset (self.records) based on the given schema.
If a timeseries is selected, this method will pass to Generator.generate_timeseries
-
generate_row
()[source]¶ Generate a row based on the parsed schema
NOTE: this uses string matching to determine if the schema is a list, faker definition or one of the matching patterns in EVAL_REGEX and then generates data based on those predefined selections.
-
generate_timeseries
()[source]¶ Generate a timeseries with a timestamp column.
This uses the parser start date and increments to
NOTE: a warning will be logged if there is an endtime given and the number of rows is not reached before the endtime. Endtime takes precedence if specified.
TODO: should num_rows take precedence over end time?
-
increment_time
()[source]¶ For timeseries generation, increment the start time given the parser requirements:
- hours, days, seconds
- if not set, returns random mix
TODO: allow for set interval?
-
output_filename
¶ Return filename if output follows proper file format file://[absolute or relative filepath]
For more use cases, please reference the Usage section.