Python遍历文件每一行：如何逐行读取文件？

2021年9月11日22:02:07 发表评论 9,149 次浏览

介绍

Python编程中的一项常见任务是打开文件并解析其内容。当您尝试处理的文件非常大时，例如几 GB 或更大的数据，您会怎么做？这个问题的答案是使用Python遍历文件每一行，一次读入一个文件的块，处理它，然后从内存中释放它，这样你就可以处理另一个块，直到整个大文件都被处理。虽然您可以为正在处理的数据块确定合适的大小，但对于许多应用程序来说，一次处理一个文件是合适的。在本文中，我们将介绍一些代码示例，这些示例演示Python逐行读取文件。如果您想自己尝试其中的一些示例，可以在以下GitHub 存储库中找到本文中使用的代码，可以帮助你快速实现Python一行一行读取文件。

Python 中的基本文件 IO
Python按行读取文件 readline()
Python逐行遍历文件 readlines()
使用for循环逐行读取文件-最佳方法！
逐行读取文件的应用

Python 中的基本文件 IO

Python 是一种出色的通用编程语言，在其内置函数和模块的标准库中具有许多非常有用的文件 IO 功能。内置open()函数用于打开文件对象以进行读取或写入目的。以下是使用它打开文件的方法：

fp = open('path/to/file.txt', 'r')

如上所示，该open()函数接受多个参数。我们将关注两个参数，第一个是位置字符串参数，表示要打开的文件的路径。第二个（可选）参数也是一个字符串，它指定您打算在函数调用返回的文件对象上使用的交互模式。下表列出了最常见的模式，默认为 'r' 用于读取：

模式	描述
`r`	打开以阅读纯文本
`w`	打开以写入纯文本
`a`	打开现有文件以追加纯文本
`rb`	打开以读取二进制数据
`wb`	打开以写入二进制数据

在文件对象中写入或读取所有所需数据后，您需要关闭该文件，以便可以在运行代码的操作系统上重新分配资源。

fp.close()

注意：关闭文件对象资源总是好的做法，但这是一项很容易忘记的任务。虽然您总是记得调用close()文件对象，但还有一种更优雅的替代方法来打开文件对象并确保 Python 解释器在使用后进行清理：

with open('path/to/file.txt') as fp:
    # Do stuff with fp

通过在我们用来打开文件对象的代码中简单地使用with关键字（在 Python 2.5 中引入），Python 将执行类似于以下代码的操作。这确保无论使用什么文件对象在使用后关闭：

try:
    fp = open('path/to/file.txt')
    # Do stuff with fp
finally:
    fp.close()

这两种方法中的任何一种都是合适的，第一个示例更加 Pythonic。从返回的文件对象open() 功能有三个共同明确的方法（read()，readline()，和readlines()）在数据读取。该read()方法将所有数据读入一个字符串。这对于您希望对整个文件进行文本操作的较小文件很有用。然后是readline()，这是一种有用的方法，可以只读取单行，一次增量读取，并将它们作为字符串返回。最后一个显式方法实现Python遍历文件每一行，readlines()将读取文件的所有行并将它们作为字符串列表返回。注意：对于本文的其余部分，我们将使用“荷马史诗”一书的文本，该书可以在Gutenberg.org以及本文代码所在的 GitHub 存储库中找到。

使用readline()在 Python 中逐行读取文件

Python一行一行读取文件：让我们从readline()读取一行的方法开始，这将要求我们使用一个计数器并递增它：

filepath = 'Iliad.txt'
with open(filepath) as fp:
   line = fp.readline()
   cnt = 1
   while line:
       print("Line {}: {}".format(cnt, line.strip()))
       line = fp.readline()
       cnt += 1

此代码片段打开一个文件对象，其引用存储在中fp，然后通过readline()在while循环中迭代调用该文件对象实现Python逐行读取文件。然后它简单地将行打印到控制台。运行此代码，您应该会看到如下所示的内容：

...
Line 567: exceedingly trifling. We have no remaining inscription earlier than the
Line 568: fortieth Olympiad, and the early inscriptions are rude and unskilfully
Line 569: executed; nor can we even assure ourselves whether Archilochus, Simonides
Line 570: of Amorgus, Kallinus, Tyrtaeus, Xanthus, and the other early elegiac and
Line 571: lyric poets, committed their compositions to writing, or at what time the
Line 572: practice of doing so became familiar. The first positive ground which
Line 573: authorizes us to presume the existence of a manuscript of Homer, is in the
Line 574: famous ordinance of Solon, with regard to the rhapsodies at the
Line 575: Panathenaea: but for what length of time previously manuscripts had
Line 576: existed, we are unable to say.
...

尽管如此，这种方法是粗略和明确的。肯定不是很 Pythonic。我们可以利用该readlines()方法使这段代码更加简洁。

Python遍历文件每一行：使用readlines() 逐行读取文件

该readlines()方法读取所有行并将它们存储到一个List. 然后我们可以迭代该列表并使用enumerate(), 为我们的方便为每一行创建一个索引：

file = open('Iliad.txt', 'r')
lines = file.readlines()

for index, line in enumerate(lines):
    print("Line {}: {}".format(index, line.strip()))
    
file.close()

这导致：

...
Line 160: INTRODUCTION.
Line 161:
Line 162:
Line 163: Scepticism is as much the result of knowledge, as knowledge is of
Line 164: scepticism. To be content with what we at present know, is, for the most
Line 165: part, to shut our ears against conviction; since, from the very gradual
Line 166: character of our education, we must continually forget, and emancipate
Line 167: ourselves from, knowledge previously acquired; we must set aside old
Line 168: notions and embrace fresh ones; and, as we learn, we must be daily
Line 169: unlearning something which it has cost us no small labour and anxiety to
Line 170: acquire.
...

现在，虽然好多了，但我们甚至不需要调用该readlines()方法来实现相同的功能。这是逐行读取文件的传统方式，但还有一种更现代、更短的方式。

使用for循环逐行读取文件- 大多数 Pythonic 方法

返回的File本身是一个可迭代的。我们根本不需要通过提取线readlines()- 我们可以迭代返回的对象本身。这也使它变得容易，enumerate()因此我们可以在每个print()语句中写入行号。如何实现Python按行读取文件？这是解决问题的最短、最 Pythonic 的方法，也是最受青睐的方法：

with open('Iliad.txt') as f:
    for index, line in enumerate(f):
        print("Line {}: {}".format(index, line.strip()))

这导致：

...
Line 277: Mentes, from Leucadia, the modern Santa Maura, who evinced a knowledge and
Line 278: intelligence rarely found in those times, persuaded Melesigenes to close
Line 279: his school, and accompany him on his travels. He promised not only to pay
Line 280: his expenses, but to furnish him with a further stipend, urging, that,
Line 281: "While he was yet young, it was fitting that he should see with his own
Line 282: eyes the countries and cities which might hereafter be the subjects of his
Line 283: discourses." Melesigenes consented, and set out with his patron,
Line 284: "examining all the curiosities of the countries they visited, and
...

在这里，我们利用 Python 的内置功能，只需使用for循环即可轻松地迭代可迭代对象。如果您想了解更多关于 Python 的迭代对象的内置功能，我们已经为您提供了：

Python 的迭代工具——count()、cycle() 和 chain()
Python 的迭代工具：filter()、islice()、map() 和 zip()

Python逐行读取文件的应用

你如何实际使用它？大多数 NLP 应用程序处理大量数据。大多数时候，将整个语料库读入内存是不明智的。虽然是基本的，但您可以编写一个从头开始的解决方案来计算某些单词的频率，而无需使用任何外部库。让我们编写一个简单的脚本，加载文件，逐行读取并计算单词出现的频率，打印 10 个最常用的单词及其出现次数：

import sys
import os

def main():
   filepath = sys.argv[1]
   if not os.path.isfile(filepath):
       print("File path {} does not exist. Exiting...".format(filepath))
       sys.exit()
  
   bag_of_words = {}
   with open(filepath) as fp:
       for line in fp:
           record_word_cnt(line.strip().split(' '), bag_of_words)
   sorted_words = order_bag_of_words(bag_of_words, desc=True)
   print("Most frequent 10 words {}".format(sorted_words[:10]))
  
def order_bag_of_words(bag_of_words, desc=False):
   words = [(word, cnt) for word, cnt in bag_of_words.items()]
   return sorted(words, key=lambda x: x[1], reverse=desc)

def record_word_cnt(words, bag_of_words):
    for word in words:
        if word != '':
            if word.lower() in bag_of_words:
                bag_of_words[word.lower()] += 1
            else:
                bag_of_words[word.lower()] = 1

if __name__ == '__main__':
    main()

该脚本使用该os模块来确保我们尝试读取的文件确实存在。如果是这样，那么Python逐行读取文件并且每一行都被传递到record_word_cnt()函数中。它分隔单词之间的空格并将单词添加到字典中 - bag_of_words。一旦所有的行都被记录到字典中，我们通过order_bag_of_words()它对它进行排序，它返回一个(word, word_count)格式的元组列表，按字数排序。最后，我们使用Python一行一行读取文件，然后打印前十个最常用的单词。通常，为此，您将创建一个词袋模型，使用像 NLTK 这样的库，不过，这个实现就足够了。让我们运行脚本Iliad.txt并向它提供我们的：

$ python app.py Iliad.txt

这导致：

Most frequent 10 words [('the', 15633), ('and', 6959), ('of', 5237), ('to', 4449), ('his', 3440), ('in', 3158), ('with', 2445), ('a', 2297), ('he', 1635), ('from', 1418)]

如果您想阅读有关 NLP 的更多信息，我们提供了一系列有关各种任务的指南：Python 中的自然语言处理。

Python按行读取文件总结

在本文中，我们探索了Python遍历文件每一行的多种方法，并创建了一个基本的词袋模型来计算给定文件中单词的频率。

介绍

Python 中的基本文件 IO

使用readline()在 Python 中逐行读取文件

Python遍历文件每一行：使用readlines() 逐行读取文件

使用for循环逐行读取文件- 大多数 Pythonic 方法

Python逐行读取文件的应用

Python按行读取文件总结

发表评论取消回复

登录 注册 找回密码

登录注册找回密码