PageSplitter¶
-
class
PageSplitter.PageSplitter(boundaryRegex='\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin,pyspark.ml.util.JavaMLReadable,pyspark.ml.util.JavaMLWritable,pyspark.ml.wrapper.JavaTransformerParameters: - boundaryRegex (str) – how to split into words (default: s)
- inputCol (str) – The name of the input column
- maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
- minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
-
getMaximumPageLength()[source]¶ Returns: the maximum number of characters to be in a page (default: 5000) Return type: int
-
getMinimumPageLength()[source]¶ Returns: the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500) Return type: int
-
getOutputCol()[source]¶ Returns: The name of the output column (default: [self.uid]_output) Return type: str
-
setBoundaryRegex(value)[source]¶ Parameters: boundaryRegex (str) – how to split into words (default: s)
-
setMaximumPageLength(value)[source]¶ Parameters: maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
-
setMinimumPageLength(value)[source]¶ Parameters: minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
-
setOutputCol(value)[source]¶ Parameters: outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams(boundaryRegex='\\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]¶ Set the (keyword only) parameters
Parameters: - boundaryRegex (str) – how to split into words (default: s)
- inputCol (str) – The name of the input column
- maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
- minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
- outputCol (str) – The name of the output column (default: [self.uid]_output)