Python使用Huggingface Transformers实现对话式AI聊天机器人

2021年11月11日02:33:39 发表评论 2,798 次浏览

Python如何实现聊天机器人？了解如何使用 Huggingface Transformers库通过 Python 中预训练的 DialoGPT 模型生成对话响应。

Python实现对话式AI聊天机器人：近年来，聊天机器人非常受欢迎，随着人们对将聊天机器人用于商业的兴趣日益浓厚，研究人员在推进对话式 AI 聊天机器人方面也做得很好。

在本Huggingface Transformers聊天机器人教程中，我们将使用Huggingface 转换器库来使用预训练的DialoGPT 模型来生成对话响应。

DialoGPT 是一个大规模可调神经对话响应生成模型，它在从 Reddit 中提取的 1.47 亿个对话上进行了训练，好处是你可以用你的数据集对其进行微调，以获得比从头开始训练更好的性能。

首先，让我们安装变压器：

$ pip3 install transformers

打开一个新的 Python 文件或笔记本并执行以下操作：

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# model_name = "microsoft/DialoGPT-large"
model_name = "microsoft/DialoGPT-medium"
# model_name = "microsoft/DialoGPT-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

DialogoGPT 共有三个版本；小、中、大。当然，越大越好，但如果你在你的机器上运行它，我认为小型或中型适合你的记忆，没有问题。你还可以使用 Google Colab 来试用大的。

使用贪婪搜索生成响应

在本节中，我们将使用贪心搜索算法来生成响应。也就是说，我们选择在每个时间步中被选中概率最高的聊天机器人响应。

让我们编写代码来使用贪婪搜索与我们的 AI 聊天：

# chatting 5 times with greedy search
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        pad_token_id=tokenizer.eos_token_id,
    )
    #print the output
    output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {output}")

Python实现对话式AI聊天机器人 - 让我们解释一下这段代码的核心：

我们首先从用户那里获取用于聊天的输入。
我们input_ids使用 DialoGPT 标记器对文本进行编码，我们还附加了字符串标记的末尾并将其作为 Pytorch 张量返回。
如果这是第一次与机器人聊天，那么我们直接input_ids为我们的模型提供一代。否则，我们在torch.cat()方法的帮助下使用连接附加聊天记录。
之后，我们使用model.generate()生成聊天机器人响应的方法。
最后，由于返回的输出也是一个标记化的序列，我们使用tokenizer.decode()和设置skip_special_tokens为True对序列进行解码以确保我们没有看到任何烦人的特殊标记，例如<|endoftext|>. 此外，由于模型返回整个序列，我们跳过之前的聊天记录，只打印新生成的聊天机器人回答。

以下是与机器人的讨论示例：

>> You:How can you be rich so quickly?
DialoGPT: I'm not rich, I'm just a rich man.
>> You:Ok, you don't make sense, how old are you ?
DialoGPT: I'm not a rich man, I'm just a rich man.
>> You:How can you be happy ?
DialoGPT: I'm not happy, I'm just a rich man.
>> You:But they say money buys happiness!
DialoGPT: Money buys happiness, but not happiness itself.
>> You:You don't make sense again!
DialoGPT: I'm not happy, I'm just a rich man.

Huggingface Transformers聊天机器人教程：你会看到模型重复了很多响应，因为这些响应的概率最高，并且每次都选择它。

默认情况下，model.generate()在没有设置其他参数时使用贪婪搜索算法，在下一节中，我们将向该方法添加一些参数，看看我们是否可以改进生成。

还学习： 如何在 Python 中使用 Transformers 预训练 BERT。

使用 Beam Search 生成响应

Python如何实现聊天机器人？通过num_beams在每个时间步保留最可能的假设，然后采用总体概率最高的序列，波束搜索允许我们降低丢失高概率序列的风险，以下代码将使用波束搜索生成聊天机器人响应：

# chatting 5 times with beam search
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        num_beams=3,
        early_stopping=True,
        pad_token_id=tokenizer.eos_token_id
    )
    #print the output
    output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {output}")

复制当设置num_beams为3inmodel.generate()方法时，那么我们将在每个时间步选择 3 个单词并开发它们以找到序列的最高总体概率，设置num_beams为 1 与贪婪搜索相同。

以下是使用光束搜索与聊天机器人的示例讨论：

>> You:How can I be rich?
DialoGPT: You can't.
>> You:Why not?
DialoGPT: Because you can't.
>> You:Then how am I supposed to live well?
DialoGPT: You can't.
>> You:So basically, you want me to be a poor guy?
DialoGPT: No, I want you to be a rich guy.
>> You:Ok how to be rich, then?
DialoGPT: I don't know.

使用采样生成响应

在前面的部分中，我们使用了波束搜索和贪婪搜索来生成最高概率序列。现在这非常适合机器翻译或文本摘要等输出可预测的任务。然而，对于聊天机器人来说，它并不是开放式一代的最佳选择。

为了更好的生成，我们需要引入一些随机性，我们根据概率从广泛的候选序列中进行采样：

# chatting 5 times with sampling
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        do_sample=True,
        top_k=0,
        pad_token_id=tokenizer.eos_token_id
    )
    #print the output
    output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {output}")

Python实现对话式AI聊天机器人 - 复制这一次，我们设置do_sample来True进行采样，我们设置top_k以0表明我们选择所有可能的概率，我们将在后面讨论top_k的参数。

这是与这些参数的聊天：

>> You:Hello, how can I be rich?
DialoGPT: You can't
>> You:Really? then how others get rich?
DialoGPT: They work as money artists
>> You:How to be a money artist?
DialoGPT: I don't know, I don't have enough knowledge on artists
>> You:Alright, thank you!
DialoGPT: No problem, if you need tips pm me.
>> You:Great, you're my savior!
DialoGPT: Thank you for helping a stranger out like me!

显然有一些改进。但是，对大量低概率的序列进行采样可能会导致随机生成（就像你在最后一句话中看到的那样）。

为了进一步改进它，我们可以：

降低采样率temperature，这有助于我们降低选择低概率单词的可能性并增加选择高概率单词的可能性。
使用 Top-k 采样而不是选择所有可能出现的情况，这将帮助我们丢弃被选择的低概率单词。

# chatting 5 times with Top K sampling & tweaking temperature
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        do_sample=True,
        top_k=100,
        temperature=0.75,
        pad_token_id=tokenizer.eos_token_id
    )
    #print the output
    output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {output}")

现在，我们设置top_k为100从100按概率降序排序的前几个词中进行采样。我们还设置temperature为0.75（默认为1.0）以提供更高的机会选择高概率单词，将温度设置0.0为与贪婪搜索相同，将其设置为无穷大与完全随机相同。以下是对这些参数的讨论：

>> You:How can I get rich?
DialoGPT: I'd also like to know this.
>> You:So you don't know it?
DialoGPT: I know it, but it doesn't matter.
>> You:Really? money matters, man!
DialoGPT: Well, money is money and I gotta get my money.
>> You:Yeah how to get them?
DialoGPT: Well, that's a question. I'd like to know more, just for curiosity's sake, to know what other people have said about how I should get rich.
>> You:Some robots like you told me you have to buy gold, silver or stocks.
DialoGPT: I think my wallet is empty.

如你所见，现在好多了，可以随意调整temperature和top_k参数，看看是否可以改进它。

核取样

Python如何实现聊天机器人？Nucleus 采样或 Top-p 采样从累积概率超过p我们设置的参数的最小可能词中进行选择。

下面是一个使用 Top-p 采样的例子：

# chatting 5 times with nucleus sampling & tweaking temperature
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids = model.generate(
        bot_input_ids,
        max_length=1000,
        do_sample=True,
        top_p=0.95,
        top_k=0,
        temperature=0.75,
        pad_token_id=tokenizer.eos_token_id
    )
    #print the output
    output = tokenizer.decode(chat_history_ids[:, bot_input_ids.shape[-1]:][0], skip_special_tokens=True)
    print(f"DialoGPT: {output}")

复制我们设置top_k为0禁用 Top-k 采样，但你可以使用两种方法，这两种方法往往效果更好。这是一个聊天：

>> You:How can I be rich fast?
DialoGPT: Inflation and free money
>> You:lol, but that takes time, I need fast money.
DialoGPT: Your post makes no sense. You just need a lot of money to be rich fast
>> You:Ok how to get a lot of money?
DialoGPT: Buy gold and then trade it for money.
>> You:Hmm, I got your point, thanks.
DialoGPT: No problem. You have to have money to buy gold, but its only good for trading it.
>> You:Yeah, I understand, thank you so much.
DialoGPT: Ok, let me know if you need help

现在聊天机器人在很多情况下显然是有意义的。

Huggingface Transformers聊天机器人教程：现在让我们添加一些代码来生成多个聊天机器人响应，然后我们选择要包含在下一个输入中的响应：

# chatting 5 times with nucleus & top-k sampling & tweaking temperature & multiple
# sentences
for step in range(5):
    # take user input
    text = input(">> You:")
    # encode the input and add end of string token
    input_ids = tokenizer.encode(text + tokenizer.eos_token, return_tensors="pt")
    # concatenate new user input with chat history (if there is)
    bot_input_ids = torch.cat([chat_history_ids, input_ids], dim=-1) if step > 0 else input_ids
    # generate a bot response
    chat_history_ids_list = model.generate(
        bot_input_ids,
        max_length=1000,
        do_sample=True,
        top_p=0.95,
        top_k=50,
        temperature=0.75,
        num_return_sequences=5,
        pad_token_id=tokenizer.eos_token_id
    )
    #print the outputs
    for i in range(len(chat_history_ids_list)):
      output = tokenizer.decode(chat_history_ids_list[i][bot_input_ids.shape[-1]:], skip_special_tokens=True)
      print(f"DialoGPT {i}: {output}")
    choice_index = int(input("Choose the response you want for the next input: "))
    chat_history_ids = torch.unsqueeze(chat_history_ids_list[choice_index], dim=0)

我已设置num_return_sequences为一次5返回 5 个句子，我们必须选择将包含在下一个序列中的句子。这是它的过程：

>> You:Hello, friend.
DialoGPT 0: Hey there!
DialoGPT 1: I am your friend, and I am also your friend.
DialoGPT 2: Hello. :
DialoGPT 3: Hiya! How are you?
DialoGPT 4: Hello. I like your username.
Choose the response you want for the next input: 3
>> You:Fine, thanks, what about you?
DialoGPT 0: Good. Not much to do.
DialoGPT 1: Not much, you?
DialoGPT 2: Fine, thanks.
DialoGPT 3: I'm okay, thank you for asking.
DialoGPT 4: I'm fine too.
Choose the response you want for the next input: 2
>> You:Can you tell me how can I get rich quickly?
DialoGPT 0: Yes, I can!
DialoGPT 1: Yes, just look up how to get rich on Google.
DialoGPT 2: Just keep doing your thing. You'll get rich.
DialoGPT 3: No, I can't.
DialoGPT 4: no, i have no money
Choose the response you want for the next input: 1
>> You:Yeah, but I thought you have an answer...
DialoGPT 0: I'm getting married in the morning.
DialoGPT 1: How do you get rich?
DialoGPT 2: I think there are many good answers, but I could send you one, if you want.
DialoGPT 3: I will, thanks!
DialoGPT 4: I thought you have an answer.
Choose the response you want for the next input: 2
>> You:Yes, please send me.
DialoGPT 0: You can send me link.
DialoGPT 1: I am very interested
DialoGPT 2: I sent you a PM
DialoGPT 3: I'll send you a PM
DialoGPT 4: I am always interested in new ideas.
Choose the response you want for the next input: 2

Python实现对话式AI聊天机器人总结

Python如何实现聊天机器人？希望本教程可以帮助你了解如何在 DialoGPT 和类似模型上生成文本。有关如何生成文本的更多信息，我强烈建议你阅读How to generate text with Transformers指南。

我会让你调整参数，看看你是否能让机器人表现得更好。

此外，你可以将其与文本转语音和语音转文本教程结合起来，构建一个虚拟助手，如Alexa、Siri、Cortana等。

使用贪婪搜索生成响应

使用 Beam Search 生成响应

使用采样生成响应

核取样

Python实现对话式AI聊天机器人总结

发表评论取消回复

登录 注册 找回密码

登录注册找回密码