PageSplitter¶
-
class
PageSplitter.
PageSplitter
(boundaryRegex='\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaTransformer
Parameters: - boundaryRegex (str) – how to split into words (default: s)
- inputCol (str) – The name of the input column
- maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
- minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
-
getMaximumPageLength
()[source]¶ Returns: the maximum number of characters to be in a page (default: 5000) Return type: int
-
getMinimumPageLength
()[source]¶ Returns: the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500) Return type: int
-
getOutputCol
()[source]¶ Returns: The name of the output column (default: [self.uid]_output) Return type: str
-
setBoundaryRegex
(value)[source]¶ Parameters: boundaryRegex (str) – how to split into words (default: s)
-
setMaximumPageLength
(value)[source]¶ Parameters: maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
-
setMinimumPageLength
(value)[source]¶ Parameters: minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
-
setOutputCol
(value)[source]¶ Parameters: outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams
(boundaryRegex='\\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Set the (keyword only) parameters
Parameters: - boundaryRegex (str) – how to split into words (default: s)
- inputCol (str) – The name of the input column
- maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
- minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
- outputCol (str) – The name of the output column (default: [self.uid]_output)