discretizer
– Discretization algorithms¶
Currently, Pebl only includes one discretization implementation but more may come. Discretization and other data pre-processing steps can have a big impact on the final results.
-
pebl.discretizer.
maximum_entropy_discretize
(indata, includevars=None, excludevars=[], numbins=3)¶ Performs a maximum-entropy discretization of data in-place.
Requirements for this implementation:
- Try to make all bins equal sized (maximize the entropy)
- If datum x==y in the original dataset, then disc(x)==disc(y) For example, all datapoints with value 3.245 discretize to 1 even if it violates requirement 1.
- Number of bins reflects only the non-missing data.
Example:
input: [3,7,4,4,4,5] output: [0,1,0,0,0,1]
Note that all 4s discretize to 0, which makes bin sizes unequal.
Example:
input: [1,2,3,4,2,1,2,3,1,x,x,x] output: [0,1,2,2,1,0,1,2,0,0,0,0]
Note that the missing data (‘x’) gets put in the bin with 0.0.