如何在Python中使用Transformer执行文本摘要?

2021年11月11日02:57:03 发表评论 1,745 次浏览
了解如何使用 Huggingface Transformer和 PyTorch 库来汇总长文本,使用 Python 中的管道 API 和 T5 转换器模型,包括一些Python Transformer文本摘要示例文本摘要是将长文本缩短为简明摘要的任务,保留关键信息内容和整体含义。 Python如何提取文本摘要?有两种不同的方法被广泛用于文本摘要:
  • Extractive Summarization:这是模型从原始文本中识别重要句子和短语并仅输出那些的地方。
  • Abstractive Summarization:该模型产生一个完全不同的文本,比原始文本短,它以新的形式生成新的句子,就像人类一样。在本教程中,我们将使用转换器来实现这种方法。
在本Python Transformer获取文本摘要教程中,我们将使用Python 中的HuggingFace 转换器库对我们想要的任何文本执行抽象文本摘要。 我们之所以选择HuggingFace 的 Transformers,是因为它为我们提供了数千个预训练模型,不仅用于文本摘要,还用于各种NLP任务,例如文本分类、问答、机器翻译、文本生成、聊天机器人和更多的。 相关如何在 Python 中使用 Transformers 微调 BERT 以进行文本分类。 首先,让我们安装所需的库:
pip3 install transformers torch sentencepiece

使用管道 API

Python如何提取文本摘要?在转换器中使用模型最直接的方法是使用管道API:
from transformers import pipeline

# using pipeline API for summarization task
summarization = pipeline("summarization")
original_text = """
Paul Walker is hardly the first actor to die during a production. 
But Walker's death in November 2013 at the age of 40 after a car crash was especially eerie given his rise to fame in the "Fast and Furious" film franchise. 
The release of "Furious 7" on Friday offers the opportunity for fans to remember -- and possibly grieve again -- the man that so many have praised as one of the nicest guys in Hollywood. 
"He was a person of humility, integrity, and compassion," military veteran Kyle Upham said in an email to CNN. 
Walker secretly paid for the engagement ring Upham shopped for with his bride. 
"We didn't know him personally but this was apparent in the short time we spent with him. 
I know that we will never forget him and he will always be someone very special to us," said Upham. 
The actor was on break from filming "Furious 7" at the time of the fiery accident, which also claimed the life of the car's driver, Roger Rodas. 
Producers said early on that they would not kill off Walker's character, Brian O'Connor, a former cop turned road racer. Instead, the script was rewritten and special effects were used to finish scenes, with Walker's brothers, Cody and Caleb, serving as body doubles. 
There are scenes that will resonate with the audience -- including the ending, in which the filmmakers figured out a touching way to pay tribute to Walker while "retiring" his character. At the premiere Wednesday night in Hollywood, Walker's co-star and close friend Vin Diesel gave a tearful speech before the screening, saying "This movie is more than a movie." "You'll feel it when you see it," Diesel said. "There's something emotional that happens to you, where you walk out of this movie and you appreciate everyone you love because you just never know when the last day is you're gonna see them." There have been multiple tributes to Walker leading up to the release. Diesel revealed in an interview with the "Today" show that he had named his newborn daughter after Walker. 
Social media has also been paying homage to the late actor. A week after Walker's death, about 5,000 people attended an outdoor memorial to him in Los Angeles. Most had never met him. Marcus Coleman told CNN he spent almost $1,000 to truck in a banner from Bakersfield for people to sign at the memorial. "It's like losing a friend or a really close family member ... even though he is an actor and we never really met face to face," Coleman said. "Sitting there, bringing his movies into your house or watching on TV, it's like getting to know somebody. It really, really hurts." Walker's younger brother Cody told People magazine that he was initially nervous about how "Furious 7" would turn out, but he is happy with the film. "It's bittersweet, but I think Paul would be proud," he said. CNN's Paul Vercammen contributed to this report.
"""
summary_text = summarization(original_text)[0]['summary_text']
print("Summary:", summary_text)
请注意,第一次执行此操作时,它会下载模型架构和权重,以及分词器配置。 Python Transformer文本摘要示例 - 我们将“摘要”任务指定给管道,然后我们简单地将长文本传递给它,这是输出:
Summary:  Paul Walker died in November 2013 after a car crash in Los Angeles . 
The late actor was one of the nicest guys in Hollywood . 
The release of "Furious 7" on Friday offers a chance to grieve again . 
There have been multiple tributes to Walker leading up to the film's release .
这是另一个例子:
print("="*50)
# another example
original_text = """
For the first time in eight years, a TV legend returned to doing what he does best. 
Contestants told to "come on down!" on the April 1 edition of "The Price Is Right" encountered not host Drew Carey but another familiar face in charge of the proceedings. 
Instead, there was Bob Barker, who hosted the TV game show for 35 years before stepping down in 2007. 
Looking spry at 91, Barker handled the first price-guessing game of the show, the classic "Lucky Seven," before turning hosting duties over to Carey, who finished up. 
Despite being away from the show for most of the past eight years, Barker didn't seem to miss a beat.
"""
summary_text = summarization(original_text)[0]['summary_text']
print("Summary:", summary_text)
输出:
==================================================
Summary:  Bob Barker returns to "The Price Is Right" for the first time in eight years .
The 91-year-old hosted the show for 35 years before stepping down in 2007 . 
Drew Carey finished up hosting duties on the April 1 edition of the game show . 
Barker handled the first price-guessing game of the show .
注意:示例来自CNN/DailyMail 数据集 如你所见,模型生成了一个全新的不属于原始文本的摘要文本。 这是使用变压器的最快方法。在下一节中,我们将学习另一种执行文本摘要的方法并自定义我们想要生成输出的方式。

使用 T5 模型

Python如何提取文本摘要?以下代码单元初始化T5 转换器模型及其标记器:
from transformers import T5ForConditionalGeneration, T5Tokenizer

# initialize the model architecture and weights
model = T5ForConditionalGeneration.from_pretrained("t5-base")
# initialize the model tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-base")
第一次执行上述代码时将下载t5-base模型架构、权重、分词器词汇表和配置。 我们正在使用from_pretrained()方法将其加载为预训练模型,T5 在此库中带有 3 个版本t5-small,这是 的较小版本t5-base,并且t5-large比其他版本更大更准确。 如果你想使用与英语不同的语言进行汇总,并且在可用模型中不可用,请考虑使用你的数据集从头开始预训练模型。本教程将帮助你做到这一点。 让我们设置我们想要总结的文本:
article = """
Justin Timberlake and Jessica Biel, welcome to parenthood. 
The celebrity couple announced the arrival of their son, Silas Randall Timberlake, in statements to People. 
"Silas was the middle name of Timberlake's maternal grandfather Bill Bomar, who died in 2012, while Randall is the musician's own middle name, as well as his father's first," People reports. 
The couple announced the pregnancy in January, with an Instagram post. It is the first baby for both.
"""
Python Transformer获取文本摘要:现在让我们将此文本编码为适合模型作为输入:
# encode the text into tensor of integers using the appropriate tokenizer
inputs = tokenizer.encode("summarize: " + article, return_tensors="pt", max_length=512, truncation=True)
我们使用tokenizer.encode()方法将字符串文本转换为整数列表,其中每个整数都是一个唯一的标记。 我们设置max_length到512,这表明我们不想原文绕过512点的标记,我们还设置return_tensors"pt"获得PyTorch张量作为输出。 请注意,我们在文本前面加上了"summarize: "文本,这是因为 T5 不仅仅用于文本摘要,你基本上可以将其用于任何文本到文本的转换,例如机器翻译或问答。 例如,T5 转换器可用于机器翻译,你可以设置"translate English to German: "而不是,"summarize: "你将获得德文翻译输出(更准确地说,你将获得一个汇总的德文翻译,你将在 中看到原因model.generate())。 Python Transformer文本摘要示例:最后,让我们生成摘要文本并打印它:
# generate the summarization output
outputs = model.generate(
    inputs, 
    max_length=150, 
    min_length=40, 
    length_penalty=2.0, 
    num_beams=4, 
    early_stopping=True)
# just for debugging
print(outputs)
print(tokenizer.decode(outputs[0]))
输出:
tensor([[    0,     8,  1158,  2162,     8,  8999,    16,  1762,     3,     5,
            34,    19,     8,   166,  1871,    21,   321,    13,   135,     3,
             5,     8,  1871,    19,     8,  2214,   564,    13, 25045, 16948,
            31,     7, 28574, 18573,     6,   113,  3977,    16,  1673,     3,
             5]])
the couple announced the pregnancy in January. it is the first baby for both of them. 
the baby is the middle name of Timberlake's maternal grandfather, who died in 2012.
太棒了,输出看起来很简洁,并且是新生成的,具有新的总结风格。 进入最有趣的部分,传递给model.generate()方法的参数是:
  • max_length:要生成的最大令牌数,我们总共指定了150 个,你可以根据需要进行更改。
  • min_length:这是要生成的最小标记数,如果你仔细查看张量输出,你将计算出总共41 个标记,因此它符合我们指定的40。请注意,如果你将其设置为其他任务,例如英语到德语的翻译,这也将起作用。
  • length_penalty: 对长度的指数惩罚,1.0表示没有惩罚,增加这个参数,会增加输出文本的长度。
  • num_beams:指定此参数将导致模型使用束搜索而不是贪婪搜索,设置num_beams为4将允许模型向前查找4 个可能的单词(在贪婪搜索的情况下为1),将最可能的4个假设保持在每个时间步长,并选择总体概率最高的那个。
  • early_stopping:我们将其设置为True,以便当所有光束假设到达字符串令牌(EOS)的末尾时生成完成。
然后我们decode()使用分词器中的方法将张量转换回人类可读的文本。 还学习: 在 Python 中使用 Transformers 的对话式 AI 聊天机器人。

Python Transformer获取文本摘要总结

方法中还有很多其他参数需要调整model.generate(),我强烈建议你从 HuggingFace 博客中查看本教程。 Python如何提取文本摘要?本教程就是这样,你已经学习了两种使用 HuggingFace 的转换器库执行文本摘要的方法,请查看此处的文档。
木子山

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: