使用 Groovy、OpenNLP、CoreNLP、Nlp4j、Datumbox、Smile、Spark NLP、DJL 和 TensorFlow 进行自然语言处理

作者：Paul King
发布日期：2022-08-07 07:34AM

自然语言处理无疑是一个庞大而有时复杂的主题，涉及许多方面。其中一些方面值得单独撰写整个博客。在本博客中，我们将简要介绍一些简单的用例，说明您可以在自己的项目中使用 NLP 技术的地方。

语言检测

了解某些文本代表的语言可能是后续处理的关键第一步。让我们看看如何使用预构建模型和 Apache OpenNLP 来预测语言。这里，ResourceHelper 是用于下载和缓存模型的实用程序类。第一次运行可能需要一段时间，因为它会下载模型。后续运行应该很快。这里我们使用的是 OpenNLP 文档中提到的知名模型。

def helper = new ResourceHelper('https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/')
def model = new LanguageDetectorModel(helper.load('langdetect-183'))
def detector = new LanguageDetectorME(model)

[ spa: 'Bienvenido a Madrid', fra: 'Bienvenue à Paris',
  dan: 'Velkommen til København', bul: 'Добре дошли в София'
].each { k, v ->
    assert detector.predictLanguage(v).lang == k
}

LanguageDetectorME 类允许我们预测语言。通常，预测器对少量文本样本可能不准确，但对于我们的示例来说已经足够了。我们在地图中使用语言代码作为键，并将其与预测的语言进行比较。

更复杂的情况是训练您自己的模型。让我们看看如何使用 Datumbox 来做到这一点。Datumbox 有一个预训练模型动物园，但其语言检测模型似乎不适用于下一个示例中的小片段，因此我们将训练自己的模型。首先，我们将定义我们的数据集

def datasets = [
    English: getClass().classLoader.getResource("training.language.en.txt").toURI(),
    French: getClass().classLoader.getResource("training.language.fr.txt").toURI(),
    German: getClass().classLoader.getResource("training.language.de.txt").toURI(),
    Spanish: getClass().classLoader.getResource("training.language.es.txt").toURI(),
    Indonesian: getClass().classLoader.getResource("training.language.id.txt").toURI()
]

de 训练数据集来自 Datumbox 示例。其他语言的训练数据集来自 Kaggle。

我们设置了算法所需的训练参数

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

我们将使用带有卡方特征选择的朴素贝叶斯模型。

接下来，我们创建算法，使用训练数据集对其进行训练，然后使用训练数据集对其进行验证。通常，我们希望将数据分成训练集和测试集，以便更准确地统计模型的准确性。但为了简单起见，虽然仍然说明了 API，我们将使用整个数据集进行训练和验证

def config = Configuration.configuration
def classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

运行时，我们会看到以下输出

Classifier Accuracy (using training data): 0.9975609756097561

我们的测试数据集将包含一些硬编码的说明性短语。让我们使用我们的模型来预测每个短语的语言

[   'Bienvenido a Madrid', 'Bienvenue à Paris', 'Welcome to London',
    'Willkommen in Berlin', 'Selamat Datang di Jakarta'
].each { txt ->
    def r = classifier.predict(txt)
    def predicted = r.YPredicted
    def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
    println "Classifying: '$txt',  Predicted: $predicted,  Probability: $probability"
}

运行时，它有以下输出

Classifying: 'Bienvenido a Madrid',&nbsp; Predicted: Spanish,&nbsp; Probability: 0.83
Classifying: 'Bienvenue à Paris',&nbsp; Predicted: French,&nbsp; Probability: 0.71
Classifying: 'Welcome to London',&nbsp; Predicted: English,&nbsp; Probability: 1.00
Classifying: 'Willkommen in Berlin',&nbsp; Predicted: German,&nbsp; Probability: 0.84
Classifying: 'Selamat Datang di Jakarta',&nbsp; Predicted: Indonesian,&nbsp; Probability: 1.00

鉴于这些短语非常短，能够全部预测正确，并且概率对于这种情况来说似乎都合理，这很好。

词性标注

词性 (POS) 分析器从句子的作用角度（单词和可能的标点符号）检查句子的每个部分。典型的分析器会根据其在句子中的作用来分配或标注单词，例如识别名词、动词、形容词等等。这可能是像亚马逊、苹果和谷歌的语音助手这样的工具的关键早期步骤。

我们将从查看可能不太为人知的库 Nlp4j 开始，然后再查看其他一些库。事实上，存在多个 Nlp4j 库。我们将使用来自 nlp4j.org 的一个库，它似乎是最活跃的，并且最近更新了。

该库在幕后使用 Stanford CoreNLP 库来实现其英语 POS 功能。该库具有文档的概念以及在文档上工作的标注器。标注后，我们可以打印出所有发现的单词及其标注

var doc = new DefaultDocument()
doc.putAttribute('text', 'I eat sushi with chopsticks.')
var ann = new StanfordPosAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{  k -> "${k.facet - 'word.'}(${k.str})" }.join(' ')

运行时，我们会看到以下输出

PRP(I) VBP(eat) NN(sushi) IN(with) NNS(chopsticks) .(.)

此示例的标注（也称为标签或方面）如下

PRP

人称代词

VBP

现在时动词

名词，单数

介词

NNS

名词，复数

我们正在使用的库的文档提供了此类标注的更完整列表。

该库的一个不错之处在于它支持其他语言，特别是日语。代码非常类似，但使用不同的标注器

doc = new DefaultDocument()
doc.putAttribute('text', '私は学校に行きました。')
ann = new KuromojiAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet}(${k.str})" }.join(' ')

运行时，我们会看到以下输出

名詞(私) 助詞(は) 名詞(学校) 助詞(に) 動詞(行き) 助動詞(まし) 助動詞(た) 記号(。)

在继续之前，我们将重点介绍 GroovyConsole 的结果可视化功能。此功能允许我们编写一个小的 Groovy 脚本，将结果转换为任何 Swing 组件。在我们的例子中，我们将把标注字符串列表转换为包含 HTML（包括彩色标注框）的 JLabel 组件。此处未包含详细信息，但可以在 repo 中找到。我们需要将该文件复制到我们的 ~/.groovy 文件夹中，然后启用脚本可视化，如这里所示

How to enable visualization in the groovyconsole

然后，运行脚本时，我们应该看到以下内容

natural language processing in the groovyconsole with visualization

可视化完全是可选的，但增加了很好的效果。如果在 Jupyter/BeakerX 等笔记本环境中使用 Groovy，这些环境中也可能存在可视化工具。

让我们看看使用 Smile 库的更大的示例。

首先，我们将要检查的句子

def sentences = [
    'Paul has two sisters, Maree and Christine.',
    'No wise fish would go anywhere without a porpoise',
    'His bark was much worse than his bite',
    'Turn on the lights to the main bedroom',
    "Light 'em all up",
    'Make it dark downstairs'
]

其中几句话可能看起来有点奇怪，但它们是专门选择用来展示许多不同的 POS 标签的。

Smile 有一个分词器类，可以将句子拆分成单词。它处理许多情况，例如缩略词和缩写（“e.g.”，“'tis”，“won't”）。Smile 还具有基于隐马尔可夫模型的 POS 类，并且该类使用内置模型。以下是使用这些类的代码

def tokenizer = new SimpleTokenizer(true)
sentences.each {
    def tokens = Arrays.stream(tokenizer.split(it)).toArray(String[]::new)
    def tags = HMMPOSTagger.default.tag(tokens)*.toString()
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

我们对每个句子运行分词器。然后直接显示每个标记，或者如果它有标签，则显示其标签。

运行脚本会给出以下可视化

Paul
NNP

has
VBZ

two
CD

sisters
NNS

Maree
NNP

and
CC

Christine
NNP

No
DT

wise
JJ

fish
NN

would
MD

go
VB

anywhere
RB

without
IN

a
DT

porpoise
NN

His
PRP$

bark
NN

was
VBD

much
RB

worse
JJR

than
IN

his
PRP$

bite
NN

Turn
VB

on
IN

the
DT

lights
NNS

to
TO

the
DT

main
JJ

bedroom
NN

Light
NNP

'em
PRP

all
RB

up
RB

Make
VB

it
PRP

dark
JJ

downstairs
NN

[注意：repo 中的脚本只是打印到标准输出，这在使用命令行或 IDE 时非常完美。GroovyConsole 中的可视化只针对实际结果生效。因此，如果您在家中遵循步骤并想使用 GroovyConsole，您需要将 each 更改为 collect 并删除 println，您应该可以正常使用可视化。]

OpenNLP 代码非常类似

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
    String[] tokens = tokenizer.tokenize(it)
    def posTagger = new POSTaggerME('en')
    String[] tags = posTagger.tag(tokens)
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

OpenNLP 允许您提供自己的 POS 模型，但如果未指定，则会下载默认模型。

运行脚本后，它有以下可视化

Paul
PROPN

has
VERB

two
NUM

sisters
NOUN

,
PUNCT

Maree
PROPN

and
CCONJ

Christine
PROPN

.
PUNCT

No
DET

wise
ADJ

fish
NOUN

would
AUX

go
VERB

anywhere
ADV

without
ADP

a
DET

porpoise
NOUN

His
PRON

bark
NOUN

was
AUX

much
ADV

worse
ADJ

than
ADP

his
PRON

bite
NOUN

Turn
VERB

on
ADP

the
DET

lights
NOUN

to
ADP

the
DET

main
ADJ

bedroom
NOUN

Light
NOUN

'
PUNCT

em
NOUN

all
ADV

up
ADP

Make
VERB

it
PRON

dark
ADJ

downstairs
NOUN

细心的读者可能已经注意到，该库中使用的标签有一些细微的差异。它们本质上是相同的，但使用了略微不同的名称。在 POS 库或模型之间切换时，请注意这一点。请务必查看您正在使用的库/模型的文档，以了解可用的标签类型。

实体检测

命名实体识别 (NER) 旨在识别和分类文本中的命名实体。感兴趣的类别可能是人物、组织、地点日期等等。它是许多 NLP 领域中使用的另一种技术。

我们将从要分析的句子开始

String[] sentences = [
    "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
    "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
    'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
    'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
    'I saw Ms. May Smith waving to June Jones.',
    'The parcel was passed from May to June.',
    'The Mona Lisa by Leonardo da Vinci has been on display in the Louvre, Paris since 1797.'
]

我们将使用一些知名模型，我们将重点关注人、金钱、日期、时间和位置模型

def base = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect { model ->
    new NameFinderME(DownloadUtil.downloadModel(new URL("$base/en-ner-${model}.bin"), TokenNameFinderModel))
}

我们现在将对句子进行分词

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
        // could be made smarter by looking at probabilities and overlapping spans
        Span[] spans = finders[fi].find(tokens)
        spans.each{span ->
            def se = span.start..<span.end
            def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
            entityPos[span.start] = pos
            entityText[span.start] = "$span.type(${sentence[pos]})"
        }
    }
    entityPos.keySet().sort().reverseEach {
        def pos = entityPos[it]
        def (from, to) = [pos.from, pos.to + 1]
        sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
    }
    println sentence
}

可视化后，显示以下内容

A commit by

Daniel Sun
person

December 6, 2020
date

improved Groovy 4's language integrated query.

A commit by

Daniel
person

on Sun.,

December 6, 2020
date

improved Groovy 4's language integrated query.

The Groovy in Action book by

Dierk Koenig
person

et. al. is a bargain at

$50
money

, or indeed any price.

The conference wrapped up

yesterday
date

5:30 p.m.
time

Copenhagen
location

Denmark
location

I saw Ms.

May Smith
person

waving to

June Jones
person

The parcel was passed from

May to June
date

The Mona Lisa by

Leonardo da Vinci
person

has been on display in the Louvre,

Paris
location

since 1797
date

我们可以看到，大多数示例都按预期进行了分类。我们必须改进我们的模型才能更好地处理“May to June”示例。

扩展实体检测

我们还可以在 Spark NLP 等平台上运行命名实体检测算法，该平台将 NLP 功能添加到 Apache Spark 中。我们将使用 glove_100d 嵌入和 onto_100 NER 模型。

var assembler = new DocumentAssembler(inputCol: 'text', outputCol: 'document', cleanupMode: 'disabled')

var tokenizer = new Tokenizer(inputCols: ['document'] as String[], outputCol: 'token')

var embeddings = WordEmbeddingsModel.pretrained('glove_100d').tap {
    inputCols = ['document', 'token'] as String[]
    outputCol = 'embeddings'
}

var model = NerDLModel.pretrained('onto_100', 'en').tap {
    inputCols = ['document', 'token', 'embeddings'] as String[]
    outputCol ='ner'
}

var converter = new NerConverter(inputCols: ['document', 'token', 'ner'] as String[], outputCol: 'ner_chunk')

var pipeline = new Pipeline(stages: [assembler, tokenizer, embeddings, model, converter] as PipelineStage[])

var spark = SparkNLP.start(false, false, '16G', '', '', '')

var text = [
    "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
]
var data = spark.createDataset(text, Encoders.STRING()).toDF('text')

var pipelineModel = pipeline.fit(data)

var transformed = pipelineModel.transform(data)
transformed.show()

use(SparkCategory) {
    transformed.collectAsList().each { row ->
        def res =  row.text
        def chunks = row.ner_chunk.reverseIterator()
        while (chunks.hasNext()) {
            def chunk = chunks.next()
            int begin = chunk.begin
            int end = chunk.end
            def entity = chunk.metadata.get('entity').get()
            res = res[0..<begin] + "$entity($chunk.result)" + res[end<..-1]
        }
        println res
    }
}

我们不会在这里详细介绍所有内容。总之，代码设置了一个管道，该管道通过一系列步骤将我们的输入句子转换为块，其中每个块对应于一个检测到的实体。每个块都有一个开始位置和结束位置，以及一个关联的标签类型。

这可能看起来与我们之前的示例没有太大区别，但如果我们拥有大量数据，并且在大型集群中运行，则可以在集群中的工作节点之间分发工作。

这里我们使用了一个实用程序 SparkCategory 类，它使访问 Spark Row 实例中的信息在 Groovy 简写语法方面更加方便。我们可以使用 row.text 而不是 row.get(row.fieldIndex('text'))。以下是此实用程序类的代码

class SparkCategory {
    static get(Row r, String field) { r.get(r.fieldIndex(field)) }
}

如果执行的操作比这个简单的示例更多，则可以通过各种标准的 Groovy 技术使 SparkCategory 的使用变得隐式。

运行脚本后，我们会看到以下输出

22/08/07 12:31:39 INFO SparkContext: Running Spark version 3.3.0
...
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
...
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
...
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
PERSON(The Mona Lisa) is a DATE(16th century) oil painting created by PERSON(Leonardo). It's held at the FAC(Louvre) in GPE(Paris).

结果有以下可视化

The Mona Lisa
PERSON

is a

16th century
DATE

oil painting created by

列奥纳多
PERSON

。它位于

卢浮宫
FAC

Paris
GPE

这里 FAC 代表设施（建筑物、机场、高速公路、桥梁等），GPE 代表地理政治实体（国家、城市、州等）。

句子检测

乍一看，检测文本中的句子似乎是一个简单的概念，但实际上存在许多特殊情况。

请考虑以下文本

def text = '''
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951. It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf. Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings. He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.
'''

每个句子的结尾都有句号（尽管一般来说，它也可能是其他标点符号，如感叹号和问号）。在缩写、URL、小数等中也有句号和小数点。句子检测算法可能有一些特殊硬编码的案例，例如“Dr.”、“Ms.”或表情符号，也可能使用一些启发式方法。通常，它们也可能会用上面的例子进行训练。

以下是一些用于在上面文本中检测句子的 OpenNLP 代码

def helper = new ResourceHelper('http://opennlp.sourceforge.net/models-1.5')
def model = new SentenceModel(helper.load('en-sent'))
def detector = new SentenceDetectorME(model)
def sentences = detector.sentDetect(text)
assert text.count('.') == 28
assert sentences.size() == 4
println "Found ${sentences.size()} sentences:\n" + sentences.join('\n\n')

它有以下输出

Downloading en-sent
Found 4 sentences:
The most referenced scientific paper of all time is "Protein measurement with the
Folin phenol reagent" by Lowry, O. H., Rosebrough, N. J., Farr, A. L. & Randall,
R. J. and was published in the J. BioChem. in 1951.

It describes a method for
measuring the amount of protein (even as small as 0.2 γ, were γ is the specific
weight) in solutions and has been cited over 300,000 times and can be found here:
https://www.jbc.org/content/193/1/265.full.pdf.

Dr. Lowry completed
two doctoral degrees under an M.D.-Ph.D. program from the University of Chicago
before moving to Harvard under A. Baird Hastings.

He was also the H.O.D of
Pharmacology at Washington University in St. Louis for 29 years.

我们可以看到，它处理了示例中的所有棘手情况。

使用三元组进行关系抽取

在检测命名实体和某些单词的各种词性之后，下一步是探索它们之间的关系。这通常以主语-谓语-宾语三元组的形式完成。在我们之前的 NER 示例中，对于句子“The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.”，我们发现了各种日期、时间和位置的命名实体。

我们可以使用MinIE 库（它反过来使用 Standford CoreNLP 库）通过以下代码提取三元组

def parser = CoreNLPUtils.StanfordDepNNParser()
sentences.each { sentence ->
    def minie = new MinIE(sentence, parser, MinIE.Mode.SAFE)

    println "\nInput sentence: $sentence"
    println '============================='
    println 'Extractions:'
    for (ap in minie.propositions) {
        println "\tTriple: $ap.tripleAsString"
        def attr = ap.attribution.attributionPhrase ? ap.attribution.toStringCompact() : 'NONE'
        println "\tFactuality: $ap.factualityAsString\tAttribution: $attr"
        println '\t----------'
    }
}

之前提到的句子的输出如下所示

Input sentence: The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.
=============================
Extractions:
        Triple: "conference"    "wrapped up yesterday at"       "5:30 p.m."
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up yesterday in"       "Copenhagen"
        Factuality: (+,CT)      Attribution: NONE
        ----------
        Triple: "conference"    "wrapped up"    "yesterday"
        Factuality: (+,CT)      Attribution: NONE

现在我们可以将我们之前检测到的实体之间的关系拼凑起来。

在之前的 NER 示例中还有一个有问题的案例，“The parcel was passed from May to June.”。使用之前的模型，检测到“May to June”是一个日期。让我们使用 CoreNLP 的三元组提取直接探索它。我们不会在这里显示源代码，但 CoreNLP 支持简单和更强大的方法来解决这个问题。使用更强大的技术，该句子的输出为

Sentence #7: The parcel was passed from May to June.
root(ROOT-0, passed-4)
det(parcel-2, The-1)
nsubj:pass(passed-4, parcel-2)
aux:pass(passed-4, was-3)
case(May-6, from-5)
obl:from(passed-4, May-6)
case(June-8, to-7)
obl:to(passed-4, June-8)
punct(passed-4, .-9)

Triples:
1.0 parcel was passed
1.0 parcel was passed to June
1.0 parcel was passed from May to June
1.0 parcel was passed from May

我们可以看到，这在拼凑出我们拥有的实体及其关系方面做得更好。

情感分析

情感分析是一种 NLP 技术，用于确定数据是积极的、消极的还是中性的。Standford CoreNLP 有它用于此目的的默认模型

def doc = new Document('''
StanfordNLP is fantastic!
Groovy is great fun!
Math can be hard!
''')
for (sent in doc.sentences()) {
    println "${sent.toString().padRight(40)} ${sent.sentiment()}"
}

它有以下输出

[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.sentiment.SentimentModel - Loading sentiment model edu/stanford/nlp/models/sentiment/sentiment.ser.gz ... done [0.1 sec].
StanfordNLP is fantastic!                POSITIVE
Groovy is great fun!                     VERY_POSITIVE
Math can be hard!                        NEUTRAL

我们也可以训练我们自己的。让我们从两个数据集开始

def datasets = [
    positive: getClass().classLoader.getResource("rt-polarity.pos").toURI(),
    negative: getClass().classLoader.getResource("rt-polarity.neg").toURI()
]

我们首先使用 Datumbox，正如我们之前看到的，它需要我们算法的训练参数

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

现在我们创建我们的算法，使用训练数据集对其进行训练，并出于说明目的，针对训练数据集进行验证

def config = Configuration.configuration
TextClassifier classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

输出如下所示

[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing positive class
[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing negative class
...
Classifier Accuracy (using training data): 0.8275959103273615

现在我们可以用几句话测试我们的模型

['Datumbox is divine!', 'Groovy is great fun!', 'Math can be hard!'].each {
    def r = classifier.predict(it)
    def predicted = r.YPredicted
    def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
    println "Classifing: '$it',  Predicted: $predicted,  Probability: $probability"
}

它有以下输出

...
[main] INFO com.datumbox.framework.applications.nlp.TextClassifier - predict()
...
Classifing: 'Datumbox is divine!', Predicted: positive, Probability: 0.83
Classifing: 'Groovy is great fun!', Predicted: positive, Probability: 0.80
Classifing: 'Math can be hard!', Predicted: negative, Probability: 0.95

我们可以做同样的事情，但使用 OpenNLP。首先，我们收集我们的输入数据。OpenNLP 期待它在一个包含标记示例的单个数据集中

def trainingCollection = datasets.collect { k, v ->
    new File(v).readLines().collect{"$k $it".toString() }
}.sum()

现在，我们将训练两个模型。一个使用朴素贝叶斯，另一个使用最大熵。我们训练了这两个变体。

def variants = [
        Maxent    : new TrainingParameters(),
        NaiveBayes: new TrainingParameters((CUTOFF_PARAM): '0', (ALGORITHM_PARAM): NAIVE_BAYES_VALUE)
]
def models = [:]
variants.each{ key, trainingParams ->
    def trainingStream = new CollectionObjectStream(trainingCollection)
    def sampleStream = new DocumentSampleStream(trainingStream)
    println "\nTraining using $key"
    models[key] = DocumentCategorizerME.train('en', sampleStream, trainingParams, new DoccatFactory())
}

现在我们使用这两个变体对我们的示例句子运行情感预测

def w = sentences*.size().max()

variants.each { key, params ->
    def categorizer = new DocumentCategorizerME(models[key])
    println "\nAnalyzing using $key"
    sentences.each {
        def result = categorizer.categorize(it.split('[ !]'))
        def category = categorizer.getBestCategory(result)
        def prob = sprintf '%4.2f', result[categorizer.getIndex(category)]
        println "${it.padRight(w)} $category ($prob)"
    }
}

当我们运行它时，我们得到

Training using Maxent …done.
…

Training using NaiveBayes …done.
…

Analyzing using Maxent
OpenNLP is fantastic! positive (0.64)
Groovy is great fun! positive (0.74)
Math can be hard! negative (0.61)

Analyzing using NaiveBayes
OpenNLP is fantastic! positive (0.72)
Groovy is great fun! positive (0.81)
Math can be hard! negative (0.72)

这里的模型与我们为 Datumbox 训练的模型相比，似乎具有较低的概率水平。如果这是一个问题，我们可以尝试进一步调整训练参数。我们可能还需要一个更大的测试集来说服自己每个模型的相对优点。有些模型可能会在小型数据集上过度训练，并且在与训练数据集相似的数据上表现得非常好，但在其他数据上表现得非常差。

这个例子是受 UniversalSentenceEncoder 示例的启发，该示例位于 DJL 示例模块中。它查看了如何通过 DeepJavaLibrary (DJL) api 使用来自 TensorFlow Hub 的通用句子编码器模型。

首先我们定义一个翻译器。Translator 接口允许我们指定预处理和后处理功能。

class MyTranslator implements NoBatchifyTranslator<String[], double[][]> {
    @Override
    NDList processInput(TranslatorContext ctx, String[] raw) {
        var factory = ctx.NDManager
        var inputs = new NDList(raw.collect(factory::create))
        new NDList(NDArrays.stack(inputs))
    }

    @Override
    double[][] processOutput(TranslatorContext ctx, NDList list) {
        long numOutputs = list.singletonOrThrow().shape.get(0)
        NDList result = []
        for (i in 0..<numOutputs) {
            result << list.singletonOrThrow().get(i)
        }
        result*.toFloatArray() as double[][]
    }
}

在这里，我们手动将输入句子打包到所需的 N 维数据类型中，并将输出计算提取到 2D 双精度数组中。

接下来，我们通过首先定义预测算法的标准来创建我们的 predict 方法。我们将使用我们的翻译器，使用 TensorFlow 引擎，使用 TensorFlow Hub 中预定义的句子编码器模型，并表明我们正在创建一个文本嵌入应用程序

def predict(String[] inputs) {
    String modelUrl = "https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz"

    Criteria<String[], double[][]> criteria =
        Criteria.builder()
            .optApplication(Application.NLP.TEXT_EMBEDDING)
            .setTypes(String[], double[][])
            .optModelUrls(modelUrl)
            .optTranslator(new MyTranslator())
            .optEngine("TensorFlow")
            .optProgress(new ProgressBar())
            .build()
    try (var model = criteria.loadModel()
         var predictor = model.newPredictor()) {
        predictor.predict(inputs)
    }
}

接下来，让我们定义我们的输入字符串

String[] inputs = [
    "Cycling is low impact and great for cardio",
    "Swimming is low impact and good for fitness",
    "Palates is good for fitness and flexibility",
    "Weights are good for strength and fitness",
    "Orchids can be tricky to grow",
    "Sunflowers are fun to grow",
    "Radishes are easy to grow",
    "The taste of radishes grows on you after a while",
]
var k = inputs.size()

现在，我们将使用我们的预测器方法来计算每个句子的嵌入。我们将打印出嵌入，并计算嵌入的点积。点积（在本例中与内积相同）揭示了句子之间的相关性。

var embeddings = predict(inputs)

var z = new double[k][k]
for (i in 0..<k) {
    println "Embedding for: ${inputs[i]}\n${embeddings[i]}"
    for (j in 0..<k) {
        z[i][j] = dot(embeddings[i], embeddings[j])
    }
}

最后，我们将使用 Smile 中的 Heatmap 类来呈现一个不错的显示，突出显示数据揭示的内容

new Heatmap(inputs, inputs, z, Palette.heat(20).reverse()).canvas().with {
    title = 'Semantic textual similarity'
    setAxisLabels('', '')
    window()
}

输出向我们展示了嵌入

Loading:     100% |========================================|
2022-08-07 17:10:43.212697: ... This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
...
2022-08-07 17:10:52.589396: ... SavedModel load for tags { serve }; Status: success: OK...
...
Embedding for: Cycling is low impact and great for cardio
[-0.02865048497915268, 0.02069241739809513, 0.010843578726053238, -0.04450441896915436, ...]
...
Embedding for: The taste of radishes grows on you after a while
[0.015841705724596977, -0.03129228577017784, 0.01183396577835083, 0.022753292694687843, ...]

嵌入是相似性的指示。两个意思相似的句子通常具有相似的嵌入。

显示的图形如下所示

Heatmap plot of sentence encodings

该图形表明我们的前四个句子在某种程度上是相关的，后四个句子也是相关的，但这两组之间几乎没有关系。

结论

我们已经查看了一系列使用各种 NLP 库的 NLP 示例。希望您能看到一些在您自己的应用程序中可以使用其他 NLP 技术的情况。

使用 Groovy、OpenNLP、CoreNLP、Nlp4j、Datumbox、Smile、Spark NLP、DJL 和 TensorFlow 进行自然语言处理

语言检测

词性标注

实体检测

扩展实体检测

句子检测

使用三元组进行关系抽取

情感分析

更多信息

结论