TextFeaturizer¶
-
class
TextFeaturizer.TextFeaturizer(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin,pyspark.ml.util.JavaMLReadable,pyspark.ml.util.JavaMLWritable,pyspark.ml.wrapper.JavaEstimatorFeaturize text.
Parameters: - binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
- caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
- defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
- inputCol (str) – The name of the input column
- minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
- minTokenLength (int) – Minimum token length, >= 0. (default: 0)
- nGramLength (int) – The size of the Ngrams (default: 2)
- numFeatures (int) – Set the number of features to hash each document to (default: 262144)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
- stopWords (str) – The words to be filtered out.
- toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
- tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
- tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
- useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
- useNGram (bool) – Whether to enumerate N grams (default: false)
- useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
- useTokenizer (bool) – Whether to tokenize the input (default: true)
-
getBinary()[source]¶ Returns: If true, all nonegative word counts are set to 1 (default: false) Return type: bool
-
getCaseSensitiveStopWords()[source]¶ Returns: Whether to do a case sensitive comparison over the stop words (default: false) Return type: bool
-
getDefaultStopWordLanguage()[source]¶ Returns: Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english) Return type: str
-
getMinDocFreq()[source]¶ Returns: The minimum number of documents in which a term should appear. (default: 1) Return type: int
-
getNumFeatures()[source]¶ Returns: Set the number of features to hash each document to (default: 262144) Return type: int
-
getOutputCol()[source]¶ Returns: The name of the output column (default: [self.uid]_output) Return type: str
-
getToLowercase()[source]¶ Returns: Indicates whether to convert all characters to lowercase before tokenizing. (default: true) Return type: bool
-
getTokenizerGaps()[source]¶ Returns: Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true) Return type: bool
-
getTokenizerPattern()[source]¶ Returns: Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+) Return type: str
-
getUseIDF()[source]¶ Returns: Whether to scale the Term Frequencies by IDF (default: true) Return type: bool
-
getUseStopWordsRemover()[source]¶ Returns: Whether to remove stop words from tokenized data (default: false) Return type: bool
-
setBinary(value)[source]¶ Parameters: binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
-
setCaseSensitiveStopWords(value)[source]¶ Parameters: caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
-
setDefaultStopWordLanguage(value)[source]¶ Parameters: defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
-
setMinDocFreq(value)[source]¶ Parameters: minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
-
setMinTokenLength(value)[source]¶ Parameters: minTokenLength (int) – Minimum token length, >= 0. (default: 0)
-
setNumFeatures(value)[source]¶ Parameters: numFeatures (int) – Set the number of features to hash each document to (default: 262144)
-
setOutputCol(value)[source]¶ Parameters: outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Set the (keyword only) parameters
Parameters: - binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
- caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
- defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
- inputCol (str) – The name of the input column
- minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
- minTokenLength (int) – Minimum token length, >= 0. (default: 0)
- nGramLength (int) – The size of the Ngrams (default: 2)
- numFeatures (int) – Set the number of features to hash each document to (default: 262144)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
- stopWords (str) – The words to be filtered out.
- toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
- tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
- tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
- useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
- useNGram (bool) – Whether to enumerate N grams (default: false)
- useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
- useTokenizer (bool) – Whether to tokenize the input (default: true)
-
setToLowercase(value)[source]¶ Parameters: toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
-
setTokenizerGaps(value)[source]¶ Parameters: tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
-
setTokenizerPattern(value)[source]¶ Parameters: tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
-
setUseIDF(value)[source]¶ Parameters: useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
-
setUseNGram(value)[source]¶ Parameters: useNGram (bool) – Whether to enumerate N grams (default: false)
-
class
TextFeaturizer.TextFeaturizerModel(java_model=None)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin,pyspark.ml.wrapper.JavaModel,pyspark.ml.util.JavaMLWritable,pyspark.ml.util.JavaMLReadableModel fitted by
TextFeaturizer.This class is left empty on purpose. All necessary methods are exposed through inheritance.