TextFeaturizer

class TextFeaturizer.TextFeaturizer(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]

Bases: mmlspark.Utils.ComplexParamsMixin, pyspark.ml.util.JavaMLReadable, pyspark.ml.util.JavaMLWritable, pyspark.ml.wrapper.JavaEstimator

Featurize text.

Parameters:
  • binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
  • caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
  • defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
  • inputCol (str) – The name of the input column
  • minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
  • minTokenLength (int) – Minimum token length, >= 0. (default: 0)
  • nGramLength (int) – The size of the Ngrams (default: 2)
  • numFeatures (int) – Set the number of features to hash each document to (default: 262144)
  • outputCol (str) – The name of the output column (default: [self.uid]_output)
  • stopWords (str) – The words to be filtered out.
  • toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
  • tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
  • tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
  • useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
  • useNGram (bool) – Whether to enumerate N grams (default: false)
  • useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
  • useTokenizer (bool) – Whether to tokenize the input (default: true)
getBinary()[source]
Returns:If true, all nonegative word counts are set to 1 (default: false)
Return type:bool
getCaseSensitiveStopWords()[source]
Returns:Whether to do a case sensitive comparison over the stop words (default: false)
Return type:bool
getDefaultStopWordLanguage()[source]
Returns:Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
Return type:str
getInputCol()[source]
Returns:The name of the input column
Return type:str
static getJavaPackage()[source]

Returns package name String.

getMinDocFreq()[source]
Returns:The minimum number of documents in which a term should appear. (default: 1)
Return type:int
getMinTokenLength()[source]
Returns:Minimum token length, >= 0. (default: 0)
Return type:int
getNGramLength()[source]
Returns:The size of the Ngrams (default: 2)
Return type:int
getNumFeatures()[source]
Returns:Set the number of features to hash each document to (default: 262144)
Return type:int
getOutputCol()[source]
Returns:The name of the output column (default: [self.uid]_output)
Return type:str
getStopWords()[source]
Returns:The words to be filtered out.
Return type:str
getToLowercase()[source]
Returns:Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
Return type:bool
getTokenizerGaps()[source]
Returns:Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
Return type:bool
getTokenizerPattern()[source]
Returns:Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
Return type:str
getUseIDF()[source]
Returns:Whether to scale the Term Frequencies by IDF (default: true)
Return type:bool
getUseNGram()[source]
Returns:Whether to enumerate N grams (default: false)
Return type:bool
getUseStopWordsRemover()[source]
Returns:Whether to remove stop words from tokenized data (default: false)
Return type:bool
getUseTokenizer()[source]
Returns:Whether to tokenize the input (default: true)
Return type:bool
classmethod read()[source]

Returns an MLReader instance for this class.

setBinary(value)[source]
Parameters:binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
setCaseSensitiveStopWords(value)[source]
Parameters:caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
setDefaultStopWordLanguage(value)[source]
Parameters:defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
setInputCol(value)[source]
Parameters:inputCol (str) – The name of the input column
setMinDocFreq(value)[source]
Parameters:minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
setMinTokenLength(value)[source]
Parameters:minTokenLength (int) – Minimum token length, >= 0. (default: 0)
setNGramLength(value)[source]
Parameters:nGramLength (int) – The size of the Ngrams (default: 2)
setNumFeatures(value)[source]
Parameters:numFeatures (int) – Set the number of features to hash each document to (default: 262144)
setOutputCol(value)[source]
Parameters:outputCol (str) – The name of the output column (default: [self.uid]_output)
setParams(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]

Set the (keyword only) parameters

Parameters:
  • binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
  • caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
  • defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
  • inputCol (str) – The name of the input column
  • minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
  • minTokenLength (int) – Minimum token length, >= 0. (default: 0)
  • nGramLength (int) – The size of the Ngrams (default: 2)
  • numFeatures (int) – Set the number of features to hash each document to (default: 262144)
  • outputCol (str) – The name of the output column (default: [self.uid]_output)
  • stopWords (str) – The words to be filtered out.
  • toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
  • tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
  • tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
  • useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
  • useNGram (bool) – Whether to enumerate N grams (default: false)
  • useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
  • useTokenizer (bool) – Whether to tokenize the input (default: true)
setStopWords(value)[source]
Parameters:stopWords (str) – The words to be filtered out.
setToLowercase(value)[source]
Parameters:toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
setTokenizerGaps(value)[source]
Parameters:tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
setTokenizerPattern(value)[source]
Parameters:tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
setUseIDF(value)[source]
Parameters:useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
setUseNGram(value)[source]
Parameters:useNGram (bool) – Whether to enumerate N grams (default: false)
setUseStopWordsRemover(value)[source]
Parameters:useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
setUseTokenizer(value)[source]
Parameters:useTokenizer (bool) – Whether to tokenize the input (default: true)
class TextFeaturizer.TextFeaturizerModel(java_model=None)[source]

Bases: mmlspark.Utils.ComplexParamsMixin, pyspark.ml.wrapper.JavaModel, pyspark.ml.util.JavaMLWritable, pyspark.ml.util.JavaMLReadable

Model fitted by TextFeaturizer.

This class is left empty on purpose. All necessary methods are exposed through inheritance.

static getJavaPackage()[source]

Returns package name String.

classmethod read()[source]

Returns an MLReader instance for this class.