Seq2Pat Public API
- class sequential.seq2pat.Attribute(values: List[list])[source]
Bases:
object
- gap()[source]
The Gap Constraint
Restricts the difference between every two consecutive event values in a pattern.
- span()[source]
The Span Constraint
Restricts the difference between the maximum and the minimum value in a pattern.
- property values
Values
The values of the attribute
- class sequential.seq2pat.Seq2Pat(sequences: List[list], max_span: Optional[int] = 10, batch_size=None, discount_factor=0.2, n_jobs=2, seed=123456)[source]
Bases:
object
Seq2Pat: Sequence-to-Pattern Generation Library
- sequences
A list of sequences each with a list of events. The event values can be all strings or all integers.
- Type:
List[list]
- max_span
The value for applying a built-in maximum span constraint to the length of items in mining, max_span=10 by default (10 items). This is going to avoid regular users to run into a scaling issue when data contains long sequences but no constraints are used to run the mining efficiently and practically. Power users can choose to drop this constraint by setting it to be None or increase the maximum span as the system has resources to support.
- Type:
Optional[int]
- batch_size
The batch_size parameter is set to be None by default, then a mining task runs on the entire data set using a single thread. When batch_size is set, Seq2Pat runs on batches of sequences instead for improving scalability. Each batch contains batch_size sequences as a random sample of entire set. This is achieved by shuffling the entire set uniformly before we sequentially split the sequences into batches. A mining task will run on each batch with a reduced minimum row count (min_frequency) threshold. Please refer to description of discount_factor parameter for how min_frequency is reduced. Resulted patterns will be aggregated from the mining results of each batch by calculating the sum of the occurrences. Finally the original minimum row count threshold is applied to the patterns after aggregation. When batch_size is None but the dataset has more than _Constants.dynamic_batch_threshold sequences, batch_size is dynamically set to be _Constants.default_seed to ease the mining task on the large dataset by default. Power users can define specific batch_size, discount_factor and n_jobs for gaining more runtime benefit.
- Type:
Optional[int]
- discount_factor
A discount factor is used to reduce the minimum row count (min_frequency) threshold when Seq2Pat is applied on a batch. The new threshold for a batch is defined to be max(min_frequency * discount_factor, 1.0/batch_size), where an integer min_frequency will be converted to a ratio first by min_frequency/number_total_sequences. Final results will be based on the aggregation of patterns from each batch by calculating the sum of the occurrences. Theoretically there is a chance that the batching results will be different from non-batching results. But a small discount_factor parameter will make the chance to be minimal and thus we have the same results as running on entire set in practices. A small value of discount_factor is thus recommended. discount_factor=0.2 by default.
- Type:
- n_jobs
n_jobs defines the number of processes (n_jobs=2 by default) that are used when mining tasks are applied on batches in parallel. If -1 all CPUs are used. If -2, all CPUs but one are used.
- Type:
Note
For power users who have interests to learn more about the designed batch processing behavior, an Experimental Results Summary in the following would be useful.
We have experimental analysis for batch_size vs. discount_factor vs. runtime tested on a data set with 100k sequences.
The results show that when batch_size increases, e.g. from 10000 to 100000, we observe an increase in runtime, while the mined patterns are all the same as mining on the entire set using single thread.
When batch_size=10000, we get the most runtime benefit compared to running on entire set.
On the same 100k sequences, we set batch_size=10000 and change discount_factor from 0.1 to 1.0. We observe that the runtime decreases as discount_factor increases. Only when discount_factor=1.0, the batching mode will miss some patterns compared to running on entire set. We would recommend discount_factor=0.2 by default for the robustness in results, at the expenses of runtime.
In an even larger test on ~1M sequences, we set batch_size=10000, discount_factor=0.8, n_jobs=8. Batch mode saves 60% of the runtime compared to running on entire set, while the resulted patterns from the two processes are the same.
When data size is small, e.g., a few thousand sequences, there is no benefit to run batch mode. Thus, we would recommend using the batch mode only when data has at least hundreds of thousands of sequences for gaining the runtime benefit.
- add_constraint(constraint: _BaseConstraint) _BaseConstraint [source]
Adds the given constraint to the constraint store.
- Parameters:
constraint (_BaseConstraint) – A constraint on an attribute object
- Returns:
The constraint handle.
- Raises:
TypeError – If the constraint is already defined on this attribute.:
ValueError – If there is a mismatch in length of sequences and their attributes.:
- get_patterns(min_frequency: Union[int, float]) List[list] [source]
Performs the mining operation enforcing the constraints and returns the most frequent patterns.
- Parameters:
min_frequency (Num) – If int, represents the minimum number of sequences (rows) a pattern should occur. If float, should be (0.0, 1.0] and represents the minimum percentage of sequences (rows) a pattern should occur.
- Returns:
Each inner list represents a frequent pattern in the form [event_1, event_2, event_3, … event_n, frequency]. The last element is the frequency of the pattern. Sequences are sored by decreasing frequency, i.e., most frequent pattern first.