TextFeaturizer¶
-
class
TextFeaturizer.
TextFeaturizer
(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin
,pyspark.ml.util.JavaMLReadable
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.wrapper.JavaEstimator
Featurize text.
Parameters: - binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
- caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
- defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
- inputCol (str) – The name of the input column
- minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
- minTokenLength (int) – Minimum token length, >= 0. (default: 0)
- nGramLength (int) – The size of the Ngrams (default: 2)
- numFeatures (int) – Set the number of features to hash each document to (default: 262144)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
- stopWords (str) – The words to be filtered out.
- toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
- tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
- tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
- useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
- useNGram (bool) – Whether to enumerate N grams (default: false)
- useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
- useTokenizer (bool) – Whether to tokenize the input (default: true)
-
getBinary
()[source]¶ Returns: If true, all nonegative word counts are set to 1 (default: false) Return type: bool
-
getCaseSensitiveStopWords
()[source]¶ Returns: Whether to do a case sensitive comparison over the stop words (default: false) Return type: bool
-
getDefaultStopWordLanguage
()[source]¶ Returns: Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english) Return type: str
-
getMinDocFreq
()[source]¶ Returns: The minimum number of documents in which a term should appear. (default: 1) Return type: int
-
getNumFeatures
()[source]¶ Returns: Set the number of features to hash each document to (default: 262144) Return type: int
-
getOutputCol
()[source]¶ Returns: The name of the output column (default: [self.uid]_output) Return type: str
-
getToLowercase
()[source]¶ Returns: Indicates whether to convert all characters to lowercase before tokenizing. (default: true) Return type: bool
-
getTokenizerGaps
()[source]¶ Returns: Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true) Return type: bool
-
getTokenizerPattern
()[source]¶ Returns: Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+) Return type: str
-
getUseIDF
()[source]¶ Returns: Whether to scale the Term Frequencies by IDF (default: true) Return type: bool
-
getUseStopWordsRemover
()[source]¶ Returns: Whether to remove stop words from tokenized data (default: false) Return type: bool
-
setBinary
(value)[source]¶ Parameters: binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
-
setCaseSensitiveStopWords
(value)[source]¶ Parameters: caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
-
setDefaultStopWordLanguage
(value)[source]¶ Parameters: defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
-
setMinDocFreq
(value)[source]¶ Parameters: minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
-
setMinTokenLength
(value)[source]¶ Parameters: minTokenLength (int) – Minimum token length, >= 0. (default: 0)
-
setNumFeatures
(value)[source]¶ Parameters: numFeatures (int) – Set the number of features to hash each document to (default: 262144)
-
setOutputCol
(value)[source]¶ Parameters: outputCol (str) – The name of the output column (default: [self.uid]_output)
-
setParams
(binary=False, caseSensitiveStopWords=False, defaultStopWordLanguage='english', inputCol=None, minDocFreq=1, minTokenLength=0, nGramLength=2, numFeatures=262144, outputCol=None, stopWords=None, toLowercase=True, tokenizerGaps=True, tokenizerPattern='\\s+', useIDF=True, useNGram=False, useStopWordsRemover=False, useTokenizer=True)[source]¶ Set the (keyword only) parameters
Parameters: - binary (bool) – If true, all nonegative word counts are set to 1 (default: false)
- caseSensitiveStopWords (bool) – Whether to do a case sensitive comparison over the stop words (default: false)
- defaultStopWordLanguage (str) – Which language to use for the stop word remover, set this to custom to use the stopWords input (default: english)
- inputCol (str) – The name of the input column
- minDocFreq (int) – The minimum number of documents in which a term should appear. (default: 1)
- minTokenLength (int) – Minimum token length, >= 0. (default: 0)
- nGramLength (int) – The size of the Ngrams (default: 2)
- numFeatures (int) – Set the number of features to hash each document to (default: 262144)
- outputCol (str) – The name of the output column (default: [self.uid]_output)
- stopWords (str) – The words to be filtered out.
- toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
- tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
- tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
- useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
- useNGram (bool) – Whether to enumerate N grams (default: false)
- useStopWordsRemover (bool) – Whether to remove stop words from tokenized data (default: false)
- useTokenizer (bool) – Whether to tokenize the input (default: true)
-
setToLowercase
(value)[source]¶ Parameters: toLowercase (bool) – Indicates whether to convert all characters to lowercase before tokenizing. (default: true)
-
setTokenizerGaps
(value)[source]¶ Parameters: tokenizerGaps (bool) – Indicates whether regex splits on gaps (true) or matches tokens (false). (default: true)
-
setTokenizerPattern
(value)[source]¶ Parameters: tokenizerPattern (str) – Regex pattern used to match delimiters if gaps is true or tokens if gaps is false. (default: s+)
-
setUseIDF
(value)[source]¶ Parameters: useIDF (bool) – Whether to scale the Term Frequencies by IDF (default: true)
-
setUseNGram
(value)[source]¶ Parameters: useNGram (bool) – Whether to enumerate N grams (default: false)
-
class
TextFeaturizer.
TextFeaturizerModel
(java_model=None)[source]¶ Bases:
mmlspark.Utils.ComplexParamsMixin
,pyspark.ml.wrapper.JavaModel
,pyspark.ml.util.JavaMLWritable
,pyspark.ml.util.JavaMLReadable
Model fitted by
TextFeaturizer
.This class is left empty on purpose. All necessary methods are exposed through inheritance.