Usage¶
To use datafuzz in a project, you have multiple options. You can use the Command Line Interface or with your normal Python script or Jupyter notebooks.
CLI¶
The easiest and fastest way to get started using datafuzz
is via the command line interface, or CLI. There are a few ways to do so. First, you should determine if you need to generate data or simply modify data you have.
Generate command¶
If you need to generate data, you should use the cli generate
command. This has two options:
- Utilize a YAML file which defines the schema for your synthetic data.
- Pass descriptions in via command line flags (not recommended for long or complex schema as this is not easily maintainable).
A good example of the YAML usage is included in the Quickstart.
Let’s take a look at how to use the command line flags:
$ datafuzz generate --non-yaml -h
usage: datafuzz [-h] [-f FIELDS] [-v VALUES] [-o OUTPUT] [-n NUM_ROWS]
[--start_time START_TIME] [--end_time END_TIME]
[--increments {hours,seconds,days,random}]
{generate}
Generate dataset: to use
positional arguments:
{generate}
optional arguments:
-h, --help show this help message and exit
-f FIELDS, --fields FIELDS
semicolon-delimited string of field names
-v VALUES, --values VALUES
semicolon-delimited string of values.This can be a mix
of faker types and ranges
-o OUTPUT, --output OUTPUT
what output to use
-n NUM_ROWS, --num_rows NUM_ROWS
number of rows to generate
--start_time START_TIME
start time of timeseries in isoformat:YYYY-MM-
DDThh:mm:ss
--end_time END_TIME end time of timeseries in isoformat: YYYY-MM-
DDThh:mm:ss
--increments {hours,seconds,days,random}
how to increment entries
To specify we aren’t using YAML we pass a --non-yaml
flag, which allows us to access the CLI parsers. For generation, we see a long list of possible options, let’s try a few!:
$ datafuzz generate -f 'name;age;city' -v 'faker.name;range(30,40);faker.city' -n 200 -o file://friends.csv
dataset now available at friends.csv
Now let’s check the content:
$ head -n 5 friends.csv
name,city,age
Eric Walsh,West Brandy,36
Jason Willis,Port Stephen,37
Kyle Greer,North Brandon,32
Mathew Ward,North Ginabury,32
That was easy! :)
For a review of all options you can use with the generate
command, check out the Generators.
Run command¶
A second option might be that you want to modify data you have or data you just generated. To do so, you can use the run
command. Similar to the generate
command, this has two option:
- Utilize a YAML file which defines the different transformations to run on your data
- Pass a type of run directly into the command line (and repeat as needed)
A good example of the YAML usage is included in the Quickstart.
Let’s take a look at run with just command line options:
$ datafuzz run --non-yaml -h
usage: datafuzz [-h] [-i INPUT] [-o OUTPUT] [-s STRATEGIES] [--db_uri DB_URI]
[--query QUERY] [--table TABLE]
{run}
Apply datafuzz strategies to input, return output
positional arguments:
{run}
optional arguments:
-h, --help show this help message and exit
-i INPUT, --input INPUT
input string (filename or sql)
-o OUTPUT, --output OUTPUT
input string (filename or sql)
-s STRATEGIES, --strategies STRATEGIES
dictionary defining the strategies to take
--db_uri DB_URI If using database, the db URI to connect
--query QUERY If using db input, query to collect data
--table TABLE If using db output, table to insert into
Okay, let’s give it a shot with our newly generated friends.csv
file:
$ datafuzz run -i file://friends.csv -o file://fuzzy_friends.csv -s '{"type": "fuzz", "percentage": 30}'
dataset now available at fuzzy_friends.csv
And we can check our output:
$ head -n 5 fuzzy_friends.csv
,name,city,age
0,Eric Walsh,b'\xef\xbb\xbf'West Brandy,36
1,Jason Willis,Port Stephen,37
2,Kyle Greer,North Brandon,32
3,Mathew Ward,North Ginabury,32
And indeed, our friends now have some fuzz! For a review of all options you can use with the run
command, check out the Strategies.
For a more in-depth look into datafuzz
, see Developer Interface.
Using datafuzz with Python or Jupyter¶
You don’t need to use datafuzz
with the CLI, you can also use it with your native Python scripts, frameworks or Jupyter notebooks. To see some Jupyter notebook integration examples, check out the Jupyter Notebooks included in the examples directory.
For integration with your Python script, the necessary parameters for initialization may differ depending on the class you are using to transform your data.
To do so, you might start with a dataset in the shape of a Pandas DataFrame or a numpy matrix or even a Python list of dictionaries and list. You could also generate a new dataset by using the generator class.
Let’s generate a simple timeseries using the generator:
from datafuzz.generators import DatasetGenerator
generator = DatasetGenerator({
'output': 'pandas',
'schema': {
'category': list('ABCD'),
'model': range(4,8),
'plate': 'faker.license_plate',
'year': range(2001, 2018),
'color': 'faker.safe_color_name',
'price': range(20000, 50000, 1000)
},
'num_rows': 1000,
})
generator.generate()
dataset = generator.to_output()
print(dataset.head())
Your output should look something like this:
category color model plate price year
0 A fuchsia 4 0736 CF 20000 2003
1 B teal 6 EXS 036 29000 2004
2 D teal 6 1QX5388 32000 2009
3 C navy 5 6P 15774 30000 2011
4 A white 4 0SQ D88 31000 2013
Now we have a dataset that holds our generated dataframe. If instead we had imported or transformed the data into a dataframe, we can start at this step.
Now that we have some data to work with, let’s determine what transformation there are available. The strategies available are the following classes:
- Duplicator
- Fuzzer
- NoiseMaker
Let’s use the NoiseMaker
class to add some noise to our dataset.
from datafuzz import DataSet, NoiseMaker
dataset = DataSet(dataset,
output='file://my_new_file.json')
noiser = NoiseMaker(
dataset,
noise=['add_nulls', 'random'],
columns=['price', 'model', 'year'],
percentage=30,
)
noiser.run_strategy()
At this point, your DataSet
object is transformed. You can check it by looking at the 5 initial items in the dataset:
print(dataset[:5])
Your data should now be a bit messy:
category color model plate price year
0 D maroon 4.376930 179-IYJ 29000.000000 2006.000000
1 D green 4.000000 P 336983 15468.136372 1598.702067
2 D gray 5.270262 DIV-042 20000.000000 2002.000000
3 C aqua 2.017815 84R 707 38000.000000 NaN
4 C fuchsia 6.000000 8078 TU 37000.000000 1995.355014
You can continue running transformations, if you like:
from datafuzz import Duplicator
duplicator = Duplicator(
dataset,
percentage=20,
)
duplicator.run_strategy()
When you are done with the transformations, you can export the data depending on the output you set when the dataset was first initiated. You can also set a new output string. Available outputs are as follows:
- pandas dataframe: ‘pandas’
- numpy 2D array: ‘numpy’
datafuzz.DataSet
: ‘dataset’- list: ‘list’
- CSV files: ‘file://foo.csv’
- JSON files: ‘file://foo.json’
- SQL: ‘sql’
If you use the ‘sql’ output, you need to also set a value for 'table'
and for 'db_uri'
. For an in-depth treatment of input and output options, please see I/O Options.
To then get the output, you need to run to_output
:
output = dataset.to_output()
print(output)
And now to check the file:
head -c 200 my_new_file.csv
{"0":{"category":"D","color":"maroon","model":2.5774000851,"plate":"0ME 062","price":null,"year":2008.0},"1":{"category":"C","color":"black","model":null,"plate":"UGV-266","price":39000.0,"year":2010.
That covers the vast majority of the functionality contained within datafuzz. Want to see more features? Check the backlog and feel free to follow steps for contributing!
If you want to see more examples of functionality, check out the examples in the repository.
For a longer explanation of the generators and their functionality, see the Generators documentation. For the same treatment for strategies, see the Strategies documentation.
For a review of I/O functionality and options (for both parsers and DataSet
output), see the I/O Options documentation.