PageSplitter

class PageSplitter.PageSplitter(boundaryRegex='\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]

Bases: mmlspark.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaTransformer

Parameters:
  • boundaryRegex (str) – how to split into words (default: s)
  • inputCol (str) – The name of the input column
  • maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
  • minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
  • outputCol (str) – The name of the output column (default: [self.uid]_output)
getBoundaryRegex()[source]
Returns:how to split into words (default: s)
Return type:str
getInputCol()[source]
Returns:The name of the input column
Return type:str
static getJavaPackage()[source]

Returns package name String.

getMaximumPageLength()[source]
Returns:the maximum number of characters to be in a page (default: 5000)
Return type:int
getMinimumPageLength()[source]
Returns:the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
Return type:int
getOutputCol()[source]
Returns:The name of the output column (default: [self.uid]_output)
Return type:str
classmethod read()[source]

Returns an MLReader instance for this class.

setBoundaryRegex(value)[source]
Parameters:boundaryRegex (str) – how to split into words (default: s)
setInputCol(value)[source]
Parameters:inputCol (str) – The name of the input column
setMaximumPageLength(value)[source]
Parameters:maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
setMinimumPageLength(value)[source]
Parameters:minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
setOutputCol(value)[source]
Parameters:outputCol (str) – The name of the output column (default: [self.uid]_output)
setParams(boundaryRegex='\\s', inputCol=None, maximumPageLength=5000, minimumPageLength=4500, outputCol=None)[source]

Set the (keyword only) parameters

Parameters:
  • boundaryRegex (str) – how to split into words (default: s)
  • inputCol (str) – The name of the input column
  • maximumPageLength (int) – the maximum number of characters to be in a page (default: 5000)
  • minimumPageLength (int) – the the minimum number of characters to have on a page in order to preserve work boundaries (default: 4500)
  • outputCol (str) – The name of the output column (default: [self.uid]_output)