pytsv package

Submodules

pytsv.configs module

configuration for this project

class pytsv.configs.ConfigAggregateColumns[source]

Bases: Config

Parameters to select which columns to aggregate

aggregate_columns = []
class pytsv.configs.ConfigBucketNumber[source]

Bases: Config

Parameters to configure the bucket number for a histogram

bucket_number = 10
class pytsv.configs.ConfigCheckUnique[source]

Bases: Config

Configure whether or not to check a column for uniqueness

check_unique = True
class pytsv.configs.ConfigColumn[source]

Bases: Config

Parameters to select which column to work on

column = <pytconf.param.Unique object>
class pytsv.configs.ConfigColumns[source]

Bases: Config

Parameters to select which columns to use

columns = []
class pytsv.configs.ConfigCsvToTsv[source]

Bases: Config

Parameters to control the CSV to TSV conversion process

check_num_fields = True
replace_tabs_with_spaces = True
set_max = True
class pytsv.configs.ConfigFixTypes[source]

Bases: Config

Parameters to control which fixes to apply to a TSV file.

clean_edges = True
lower_case = True
remove_non_ascii = True
sub_trailing = True
class pytsv.configs.ConfigFloatingPoint[source]

Bases: Config

Parameters to select whether to work with floating point or not

floating_point = True
class pytsv.configs.ConfigInputFile[source]

Bases: Config

Parameters to specify input file

input_file = <pytconf.param.Unique object>
class pytsv.configs.ConfigInputFiles[source]

Bases: Config

Parameters to specify input files

input_files = []
class pytsv.configs.ConfigJoin[source]

Bases: Config

Parameters to configure a TSV join operation

hash_file = <pytconf.param.Unique object>
hash_key_column = <pytconf.param.Unique object>
hash_value_column = <pytconf.param.Unique object>
input_key_column = <pytconf.param.Unique object>
output_add_unknown = False
output_insert_column = <pytconf.param.Unique object>
class pytsv.configs.ConfigMajority[source]

Bases: Config

Config the parameters for the majority algorithm

input_first_column = <pytconf.param.Unique object>
input_multiplication_column = <pytconf.param.Unique object>
input_second_column = <pytconf.param.Unique object>
class pytsv.configs.ConfigMatchColumns[source]

Bases: Config

Parameters to select which columns to match by

match_columns = []
class pytsv.configs.ConfigNumFields[source]

Bases: Config

Parameter to config number of fields in a TSV file

num_fields = None
class pytsv.configs.ConfigOutputFile[source]

Bases: Config

Parameters to configure the output file

output_file = <pytconf.param.Unique object>
class pytsv.configs.ConfigParallel[source]

Bases: Config

Parameters to configure how thing should run in parallel

jobs = 8
parallel = False
class pytsv.configs.ConfigPattern[source]

Bases: Config

Parameters to configure patterns of files generated

final_pattern = '{key}.tsv.gz'
pattern = '{key}_{i:04d}.tsv.gz'
class pytsv.configs.ConfigProgress[source]

Bases: Config

Parameters to control progress reporting

progress = True
class pytsv.configs.ConfigReplace[source]

Bases: Config

Configure whether you want replacements or not

replace = False
class pytsv.configs.ConfigSampleByColumnOld[source]

Bases: Config

Parameters to configure the old sample by column algorithm

hits_mode = False
class pytsv.configs.ConfigSampleByTwoColumns[source]

Bases: Config

Parameters for the sample by column command

group_column = <pytconf.param.Unique object>
class pytsv.configs.ConfigSampleColumn[source]

Bases: Config

Configuration options for sampling

sample_column = <pytconf.param.Unique object>
class pytsv.configs.ConfigSampleSize[source]

Bases: Config

Configure sample size

size = <pytconf.param.Unique object>
class pytsv.configs.ConfigTree[source]

Bases: Config

Parameters to configure the parameters of a tree to show

child_column = <pytconf.param.Unique object>
parent_column = <pytconf.param.Unique object>
roots = []
class pytsv.configs.ConfigTsvReader[source]

Bases: Config

Parameters to configure a TSV reader object

check_non_ascii = False
validate_all_lines_same_number_of_fields = True
class pytsv.configs.ConfigWeightValue[source]

Bases: Config

Config weight and Value

value_column = <pytconf.param.Unique object>
weight_column = <pytconf.param.Unique object>

pytsv.core module

class pytsv.core.TsvReader(filename: str, mode: str = 'rt', use_any_format: bool = True, validate_all_lines_same_number_of_fields: bool = True, num_fields: int | None = None, skip_comments: bool = False, check_non_ascii: bool = False, newline: str | None = '\n')[source]

Bases: object

close() None[source]
class pytsv.core.TsvWriter(filename: str, mode: str = 'wt', throw_exceptions: bool = False, sanitize: bool = True, fields_to_clean: List[int] | None = None, clean_edges: bool = True, sub_trailing: bool = True, remove_non_ascii: bool = True, lower_case: bool = True, check_num_fields: bool = True, num_fields: int | None = None, convert_to_string: bool = True, do_gzip: bool = False, filename_detect: bool = True)[source]

Bases: object

close() None[source]
write(input_list: Sequence[str]) None[source]
pytsv.core.clean(text: str, clean_edges: bool = True, sub_trailing: bool = True, remove_non_ascii: bool = True, lower_case: bool = True) str[source]
pytsv.core.do_aggregate(input_file_names: Iterable[str], match_columns: List[int], aggregate_columns: List[int], output_file_name: str, floating_point: bool) None[source]

This function aggregates a bunch of input files by integers. :param input_file_names: :param match_columns: :param aggregate_columns: :param output_file_name: :param floating_point: :return:

pytsv.core.group_by(input_file_names: Iterable[str], group_by_columns: List[int], collect_columns: List[int], output_file_template: str) List[str][source]
pytsv.core.is_ascii(s: str) bool[source]
pytsv.core.write_data(data: List[List[str]], output_file_name: str) None[source]
pytsv.core.write_dict(filename: str, d: Dict[str, str]) None[source]

pytsv.main module

class pytsv.main.JobInfo(check_not_ascii: bool, input_file: str, serial: int, progress: bool, pattern: str, columns: List[int])[source]

Bases: object

check_not_ascii: bool
columns: List[int]
input_file: str
pattern: str
progress: bool
serial: int
class pytsv.main.JobReturnValue(serial: int, files: Dict[str, str])[source]

Bases: object

files: Dict[str, str]
serial: int
class pytsv.main.MyEventTypes(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

key_found = 1
key_not_found = 0
unknown_added = 2
class pytsv.main.ParamsForJob[source]

Bases: object

pytsv.main.aggregate() None[source]
pytsv.main.check() None[source]

TODO: - add ability to say how many lines are bad and print their content

pytsv.main.check_columns_unique() None[source]
pytsv.main.check_file(params_for_job: ParamsForJob) bool[source]
pytsv.main.clean_by_field_num() None[source]
pytsv.main.csv_to_tsv() None[source]
pytsv.main.cut() None[source]
pytsv.main.drop_duplicates_by_columns() None[source]
pytsv.main.fix_columns() None[source]
pytsv.main.histogram_by_column() None[source]
pytsv.main.join() None[source]
pytsv.main.lc() None[source]
pytsv.main.main()[source]
pytsv.main.majority() None[source]

This means that if x1 appears more with y2 than any other values in column Y then x1, y2 will be in the output and no other entry with x1 will appear

pytsv.main.multiply() None[source]
pytsv.main.process_single_file(job_info: JobInfo) JobReturnValue[source]
pytsv.main.read() None[source]
pytsv.main.remove_quotes() None[source]
pytsv.main.sample_by_column() None[source]

To run this you must supply a ‘value_column’ (the column which will be sampled) and a ‘weight_column’ which must be convertible to a floating point number.

pytsv.main.sample_by_column_old() None[source]
pytsv.main.sample_by_two_columns() None[source]
pytsv.main.split_by_columns() None[source]
pytsv.main.split_by_columns_parallel() None[source]
pytsv.main.sum_columns() None[source]
pytsv.main.tree() None[source]

You can also see only parts of the tree

pytsv.main.tsv_to_csv() None[source]

pytsv.static module

version which can be consumed from within the module

Module contents