Как найти все файлы в папке python

Preliminary notes

  • Although there’s a clear differentiation between file and directory terms in the question text, some may argue that directories are actually special files

  • The statement: «all files of a directory» can be interpreted in two ways:

    1. All direct (or level 1) descendants only

    2. All descendants in the whole directory tree (including the ones in sub-directories)

  • When the question was asked, I imagine that Python 2, was the LTS version, however the code samples will be run by Python 3(.5) (I’ll keep them as Python 2 compliant as possible; also, any code belonging to Python that I’m going to post, is from v3.5.4 — unless otherwise specified).
    That has consequences related to another keyword in the question: «add them into a list«:

    • In pre Python 2.2 versions, sequences (iterables) were mostly represented by lists (tuples, sets, …)

    • In Python 2.2, the concept of generator ([Python.Wiki]: Generators) — courtesy of [Python.Docs]: Simple statements — The yield statement) — was introduced. As time passed, generator counterparts started to appear for functions that returned / worked with lists

    • In Python 3, generator is the default behavior

    • Not sure if returning a list is still mandatory (or a generator would do as well), but passing a generator to the list constructor, will create a list out of it (and also consume it). The example below illustrates the differences on [Python.Docs]: Built-in functions — map(function, iterable, *iterables)

    >>> import sys
    >>>
    >>> sys.version
    '2.7.10 (default, Mar  8 2016, 15:02:46) [MSC v.1600 64 bit (AMD64)]'
    >>> m = map(lambda x: x, [1, 2, 3])  # Just a dummy lambda function
    >>> m, type(m)
    ([1, 2, 3], <type 'list'>)
    >>> len(m)
    3
    

    >>> import sys
    >>>
    >>> sys.version
    '3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)]'
    >>> m = map(lambda x: x, [1, 2, 3])
    >>> m, type(m)
    (<map object at 0x000001B4257342B0>, <class 'map'>)
    >>> len(m)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: object of type 'map' has no len()
    >>> lm0 = list(m)  # Build a list from the generator
    >>> lm0, type(lm0)
    ([1, 2, 3], <class 'list'>)
    >>>
    >>> lm1 = list(m)  # Build a list from the same generator
    >>> lm1, type(lm1)  # Empty list now - generator already exhausted
    ([], <class 'list'>)
    
  • The examples will be based on a directory called root_dir with the following structure (this example is for Win, but I’m using the same tree on Nix as well). Note that I’ll be reusing the console:

    [cfati@CFATI-5510-0:e:WorkDevStackOverflowq003207219]> sopr.bat
    ### Set shorter prompt to better fit when pasted in StackOverflow (or other) pages ###
    
    [prompt]> 
    [prompt]> tree /f "root_dir"
    Folder PATH listing for volume Work
    Volume serial number is 00000029 3655:6FED
    E:WORKDEVSTACKOVERFLOWQ003207219ROOT_DIR
    ¦   file0
    ¦   file1
    ¦
    +---dir0
    ¦   +---dir00
    ¦   ¦   ¦   file000
    ¦   ¦   ¦
    ¦   ¦   +---dir000
    ¦   ¦           file0000
    ¦   ¦
    ¦   +---dir01
    ¦   ¦       file010
    ¦   ¦       file011
    ¦   ¦
    ¦   +---dir02
    ¦       +---dir020
    ¦           +---dir0200
    +---dir1
    ¦       file10
    ¦       file11
    ¦       file12
    ¦
    +---dir2
    ¦   ¦   file20
    ¦   ¦
    ¦   +---dir20
    ¦           file200
    ¦
    +---dir3
    

Solutions

Programmatic approaches

1. [Python.Docs]: os.listdir(path=’.’)

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order, and does not include the special entries '.' and '..'

>>> import os
>>>
>>> root_dir = "root_dir"  # Path relative to current dir (os.getcwd())
>>>
>>> os.listdir(root_dir)  # List all the items in root_dir
['dir0', 'dir1', 'dir2', 'dir3', 'file0', 'file1']
>>>
>>> [item for item in os.listdir(root_dir) if os.path.isfile(os.path.join(root_dir, item))]  # Filter items and only keep files (strip out directories)
['file0', 'file1']

A more elaborate example (code_os_listdir.py):

#!/usr/bin/env python

import os
import sys
from pprint import pformat as pf


def _get_dir_content(path, include_folders, recursive):
    entries = os.listdir(path)
    for entry in entries:
        entry_with_path = os.path.join(path, entry)
        if os.path.isdir(entry_with_path):
            if include_folders:
                yield entry_with_path
            if recursive:
                for sub_entry in _get_dir_content(entry_with_path, include_folders, recursive):
                    yield sub_entry
        else:
            yield entry_with_path


def get_dir_content(path, include_folders=True, recursive=True, prepend_folder_name=True):
    path_len = len(path) + len(os.path.sep)
    for item in _get_dir_content(path, include_folders, recursive):
        yield item if prepend_folder_name else item[path_len:]


def _get_dir_content_old(path, include_folders, recursive):
    entries = os.listdir(path)
    ret = list()
    for entry in entries:
        entry_with_path = os.path.join(path, entry)
        if os.path.isdir(entry_with_path):
            if include_folders:
                ret.append(entry_with_path)
            if recursive:
                ret.extend(_get_dir_content_old(entry_with_path, include_folders, recursive))
        else:
            ret.append(entry_with_path)
    return ret


def get_dir_content_old(path, include_folders=True, recursive=True, prepend_folder_name=True):
    path_len = len(path) + len(os.path.sep)
    return [item if prepend_folder_name else item[path_len:] for item in _get_dir_content_old(path, include_folders, recursive)]


def main(*argv):
    root_dir = "root_dir"
    ret0 = get_dir_content(root_dir, include_folders=True, recursive=True, prepend_folder_name=True)
    lret0 = list(ret0)
    print("{:} {:d}n{:s}".format(ret0, len(lret0), pf(lret0)))
    ret1 = get_dir_content_old(root_dir, include_folders=False, recursive=True, prepend_folder_name=False)
    print("n{:d}n{:s}".format(len(ret1), pf(ret1)))


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}n".format(" ".join(elem.strip() for elem in sys.version.split("n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("nDone.n")
    sys.exit(rc)

Notes:

  • There are two implementations:

    1. One that uses generators (of course here it seems useless, since I immediately convert the result to a list)

    2. The classic one (function names ending in _old)

  • Recursion is used (to get into subdirectories)

  • For each implementation there are two functions:

    • One that starts with an underscore (_): «private» (should not be called directly) — that does all the work

    • The public one (wrapper over previous): it just strips off the initial path (if required) from the returned entries. It’s an ugly implementation, but it’s the only idea that I could come with at this point

  • In terms of performance, generators are generally a little bit faster (considering both creation and iteration times), but I didn’t test them in recursive functions, and also I am iterating inside the function over inner generators — don’t know how performance friendly is that

  • Play with the arguments to get different results

Output:

[prompt]> "e:WorkDevVEnvspy_pc064_03.05.04_test0Scriptspython.exe" ".code_os_listdir.py"
Python 3.5.4 (v3.5.4:3f56838, Aug  8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] 064bit on win32

<generator object get_dir_content at 0x000002C080418F68> 22
['root_dir\dir0',
 'root_dir\dir0\dir00',
 'root_dir\dir0\dir00\dir000',
 'root_dir\dir0\dir00\dir000\file0000',
 'root_dir\dir0\dir00\file000',
 'root_dir\dir0\dir01',
 'root_dir\dir0\dir01\file010',
 'root_dir\dir0\dir01\file011',
 'root_dir\dir0\dir02',
 'root_dir\dir0\dir02\dir020',
 'root_dir\dir0\dir02\dir020\dir0200',
 'root_dir\dir1',
 'root_dir\dir1\file10',
 'root_dir\dir1\file11',
 'root_dir\dir1\file12',
 'root_dir\dir2',
 'root_dir\dir2\dir20',
 'root_dir\dir2\dir20\file200',
 'root_dir\dir2\file20',
 'root_dir\dir3',
 'root_dir\file0',
 'root_dir\file1']

11
['dir0\dir00\dir000\file0000',
 'dir0\dir00\file000',
 'dir0\dir01\file010',
 'dir0\dir01\file011',
 'dir1\file10',
 'dir1\file11',
 'dir1\file12',
 'dir2\dir20\file200',
 'dir2\file20',
 'file0',
 'file1']

Done.

2. [Python.Docs]: os.scandir(path=’.’)

In Python 3.5+ only, backport: [PyPI]: scandir:

Return an iterator of os.DirEntry objects corresponding to the entries in the directory given by path. The entries are yielded in arbitrary order, and the special entries '.' and '..' are not included.

Using scandir() instead of listdir() can significantly increase the performance of code that also needs file type or file attribute information, because os.DirEntry objects expose this information if the operating system provides it when scanning a directory. All os.DirEntry methods may perform a system call, but is_dir() and is_file() usually only require a system call for symbolic links; os.DirEntry.stat() always requires a system call on Unix but only requires one for symbolic links on Windows.

>>> import os
>>>
>>> root_dir = os.path.join(".", "root_dir")  # Explicitly prepending current directory
>>> root_dir
'.\root_dir'
>>>
>>> scandir_iterator = os.scandir(root_dir)
>>> scandir_iterator
<nt.ScandirIterator object at 0x00000268CF4BC140>
>>> [item.path for item in scandir_iterator]
['.\root_dir\dir0', '.\root_dir\dir1', '.\root_dir\dir2', '.\root_dir\dir3', '.\root_dir\file0', '.\root_dir\file1']
>>>
>>> [item.path for item in scandir_iterator]  # Will yield an empty list as it was consumed by previous iteration (automatically performed by the list comprehension)
[]
>>>
>>> scandir_iterator = os.scandir(root_dir)  # Reinitialize the generator
>>> for item in scandir_iterator :
...     if os.path.isfile(item.path):
...             print(item.name)
...
file0
file1

Notes:

  • Similar to os.listdir

  • But it’s also more flexible (and offers more functionality), more Pythonic (and in some cases, faster)

3. [Python.Docs]: os.walk(top, topdown=True, onerror=None, followlinks=False)

Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).

>>> import os
>>>
>>> root_dir = os.path.join(os.getcwd(), "root_dir")  # Specify the full path
>>> root_dir
'E:\Work\Dev\StackOverflow\q003207219\root_dir'
>>>
>>> walk_generator = os.walk(root_dir)
>>> root_dir_entry = next(walk_generator)  # First entry corresponds to the root dir (passed as an argument)
>>> root_dir_entry
('E:\Work\Dev\StackOverflow\q003207219\root_dir', ['dir0', 'dir1', 'dir2', 'dir3'], ['file0', 'file1'])
>>>
>>> root_dir_entry[1] + root_dir_entry[2]  # Display dirs and files (direct descendants) in a single list
['dir0', 'dir1', 'dir2', 'dir3', 'file0', 'file1']
>>>
>>> [os.path.join(root_dir_entry[0], item) for item in root_dir_entry[1] + root_dir_entry[2]]  # Display all the entries in the previous list by their full path
['E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0', 'E:\Work\Dev\StackOverflow\q003207219\root_dir\dir1', 'E:\Work\Dev\StackOverflow\q003207219\root_dir\dir2', 'E:\Work\Dev\StackOverflow\q003207219\root_dir\dir3', 'E:\Work\Dev\StackOverflow\q003207219\root_dir\file0', 'E:\Work\Dev\StackOverflow\q003207219\root_dir\file1']
>>>
>>> for entry in walk_generator:  # Display the rest of the elements (corresponding to every subdir)
...     print(entry)
...
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0', ['dir00', 'dir01', 'dir02'], [])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir00', ['dir000'], ['file000'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir00\dir000', [], ['file0000'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir01', [], ['file010', 'file011'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir02', ['dir020'], [])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir02\dir020', ['dir0200'], [])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir0\dir02\dir020\dir0200', [], [])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir1', [], ['file10', 'file11', 'file12'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir2', ['dir20'], ['file20'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir2\dir20', [], ['file200'])
('E:\Work\Dev\StackOverflow\q003207219\root_dir\dir3', [], [])

Notes:

  • Under the scenes, it uses os.scandir (os.listdir on older (Python) versions)

  • It does the heavy lifting by recurring in subfolders

4. [Python.Docs]: glob.glob(pathname, *, root_dir=None, dir_fd=None, recursive=False, include_hidden=False)

Or glob.iglob:

Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).

Changed in version 3.5: Support for recursive globs using “**”.

>>> import glob, os
>>>
>>> wildcard_pattern = "*"
>>> root_dir = os.path.join("root_dir", wildcard_pattern)  # Match every file/dir name
>>> root_dir
'root_dir\*'
>>>
>>> glob_list = glob.glob(root_dir)
>>> glob_list
['root_dir\dir0', 'root_dir\dir1', 'root_dir\dir2', 'root_dir\dir3', 'root_dir\file0', 'root_dir\file1']
>>>
>>> [item.replace("root_dir" + os.path.sep, "") for item in glob_list]  # Strip the dir name and the path separator from begining
['dir0', 'dir1', 'dir2', 'dir3', 'file0', 'file1']
>>>
>>> for entry in glob.iglob(root_dir + "*", recursive=True):
...     print(entry)
...
root_dir
root_dirdir0
root_dirdir0dir00
root_dirdir0dir00dir000
root_dirdir0dir00dir000file0000
root_dirdir0dir00file000
root_dirdir0dir01
root_dirdir0dir01file010
root_dirdir0dir01file011
root_dirdir0dir02
root_dirdir0dir02dir020
root_dirdir0dir02dir020dir0200
root_dirdir1
root_dirdir1file10
root_dirdir1file11
root_dirdir1file12
root_dirdir2
root_dirdir2dir20
root_dirdir2dir20file200
root_dirdir2file20
root_dirdir3
root_dirfile0
root_dirfile1

Notes:

  • Uses os.listdir

  • For large trees (especially if recursive is on), iglob is preferred

  • Allows advanced filtering based on name (due to the wildcard)

5. [Python.Docs]: class pathlib.Path(*pathsegments)

Python 3.4+, backport: [PyPI]: pathlib2.

>>> import pathlib
>>>
>>> root_dir = "root_dir"
>>> root_dir_instance = pathlib.Path(root_dir)
>>> root_dir_instance
WindowsPath('root_dir')
>>> root_dir_instance.name
'root_dir'
>>> root_dir_instance.is_dir()
True
>>>
>>> [item.name for item in root_dir_instance.glob("*")]  # Wildcard searching for all direct descendants
['dir0', 'dir1', 'dir2', 'dir3', 'file0', 'file1']
>>>
>>> [os.path.join(item.parent.name, item.name) for item in root_dir_instance.glob("*") if not item.is_dir()]  # Display paths (including parent) for files only
['root_dir\file0', 'root_dir\file1']

Notes:

  • This is one way of achieving our goal

  • It’s the OOP style of handling paths

  • Offers lots of functionalities

6. [Python 2.Docs]: dircache.listdir(path)

  • Python 2 only

  • But, according to [GitHub]: python/cpython — (2.7) cpython/Lib/dircache.py, it’s just a (thin) wrapper over os.listdir with caching

def listdir(path):
    """List directory contents, using cache."""
    try:
        cached_mtime, list = cache[path]
        del cache[path]
    except KeyError:
        cached_mtime, list = -1, []
    mtime = os.stat(path).st_mtime
    if mtime != cached_mtime:
        list = os.listdir(path)
        list.sort()
    cache[path] = mtime, list
    return list

7. Native OS APIs

POSIX specific:

  • [Man7]: OPENDIR(3)

  • [Man7]: READDIR(3)

  • [Man7]: CLOSEDIR(3)

Available via [Python.Docs]: ctypes — A foreign function library for Python:

ctypes is a foreign function library for Python. It provides C compatible data types, and allows calling functions in DLLs or shared libraries. It can be used to wrap these libraries in pure Python.

Not directly related, but check [SO]: C function called from Python via ctypes returns incorrect value (@CristiFati’s answer) out before working with CTypes.

code_ctypes.py:

#!/usr/bin/env python3

import ctypes as cts
import sys


DT_DIR = 4
DT_REG = 8


class NixDirent64(cts.Structure):
    _fields_ = (
        ("d_ino", cts.c_ulonglong),
        ("d_off", cts.c_longlong),
        ("d_reclen", cts.c_ushort),
        ("d_type", cts.c_ubyte),
        ("d_name", cts.c_char * 256),
    )

NixDirent64Ptr = cts.POINTER(NixDirent64)


libc = this_process = cts.CDLL(None, use_errno=True)

opendir = libc.opendir
opendir.argtypes = (cts.c_char_p,)
opendir.restype = cts.c_void_p
readdir = libc.readdir
readdir.argtypes = (cts.c_void_p,)
readdir.restype = NixDirent64Ptr
closedir = libc.closedir
closedir.argtypes = (cts.c_void_p,)


def get_dir_content(path):
    ret = [path, [], []]
    pdir = opendir(cts.create_string_buffer(path.encode()))
    if not pdir:
        print("opendir returned NULL (errno: {:d})".format(cts.get_errno()))
        return ret
    cts.set_errno(0)
    while True:
        pdirent = readdir(pdir)
        if not pdirent:
            break
        dirent = pdirent.contents
        name = dirent.d_name.decode()
        if dirent.d_type & DT_DIR:
            if name not in (".", ".."):
                ret[1].append(name)
        elif dirent.d_type & DT_REG:
            ret[2].append(name)
    if cts.get_errno():
        print("readdir returned NULL (errno: {:d})".format(cts.get_errno()))
    closedir(pdir)
    return ret


def main(*argv):
    root_dir = "root_dir"
    entries = get_dir_content(root_dir)
    print("Entries:n{:}".format(entries))


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}n".format(" ".join(elem.strip() for elem in sys.version.split("n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("nDone.n")
    sys.exit(rc)

Notes:

  • It loads the three functions from LibC (libc.so — loaded in the current process) and calls them (for more details check [SO]: How do I check whether a file exists without exceptions? (@CristiFati’s answer) — last notes from item #4.). That would place this approach very close to the Python / C edge

  • NixDirent64 is the CTypes representation of struct dirent64 from [Man7]: dirent.h(0P) (so are the DT_ constants) from my Ubuntu OS. On other flavors / versions, the structure definition might differ, and if so, the CTypes alias should be updated, otherwise it will yield Undefined Behavior

  • It returns data in the os.walk‘s format. I didn’t bother to make it recursive, but starting from the existing code, that would be a fairly trivial task

  • Everything is doable on Win as well, the data (libraries, functions, structs, constants, …) differ

Output:

[cfati@cfati-5510-0:/mnt/e/Work/Dev/StackOverflow/q003207219]> python3.5 ./code_ctypes.py
Python 3.5.10 (default, Jan 15 2022, 19:53:00) [GCC 9.3.0] 064bit on linux

Entries:
['root_dir', ['dir0', 'dir1', 'dir2', 'dir3'], ['file0', 'file1']]

Done.

8. [TimGolden]: win32file.FindFilesW

Win specific:

Retrieves a list of matching filenames, using the Windows Unicode API. An interface to the API FindFirstFileW/FindNextFileW/Find close functions.

>>> import os, win32file as wfile, win32con as wcon
>>>
>>> root_dir = "root_dir"
>>> root_dir_wildcard = os.path.join(root_dir, "*")
>>> entry_list = wfile.FindFilesW(root_dir_wildcard)
>>> len(entry_list)  # Don't display the whole content as it's too long
8
>>> [entry[-2] for entry in entry_list]  # Only display the entry names
['.', '..', 'dir0', 'dir1', 'dir2', 'dir3', 'file0', 'file1']
>>>
>>> [entry[-2] for entry in entry_list if entry[0] & wcon.FILE_ATTRIBUTE_DIRECTORY and entry[-2] not in (".", "..")]  # Filter entries and only display dir names (except self and parent)
['dir0', 'dir1', 'dir2', 'dir3']
>>>
>>> [os.path.join(root_dir, entry[-2]) for entry in entry_list if entry[0] & (wcon.FILE_ATTRIBUTE_NORMAL | wcon.FILE_ATTRIBUTE_ARCHIVE)]  # Only display file "full" names
['root_dir\file0', 'root_dir\file1']

Notes:

  • win32file.FindFilesW is part of [GitHub]: mhammond/pywin32 — Python for Windows (pywin32) Extensions, which is a Python wrapper over WinAPIs

9. Use some (other) 3rd-party package that does the trick

Most likely, will rely on one (or more) of the above (maybe with slight customizations).

Notes:

  • Code is meant to be portable (except places that target a specific area — which are marked) or cross:

    • OS (Nix, Win, )

    • Python version (2, 3, )

  • Multiple path styles (absolute, relatives) were used across the above variants, to illustrate the fact that the «tools» used are flexible in this direction

  • os.listdir and os.scandir use opendir / readdir / closedir ([MS.Learn]: FindFirstFileW function (fileapi.h) / [MS.Learn]: FindNextFileW function (fileapi.h) / [MS.Learn]: FindClose function (fileapi.h)) (via [GitHub]: python/cpython — (main) cpython/Modules/posixmodule.c)

  • win32file.FindFilesW uses those (Win specific) functions as well (via [GitHub]: mhammond/pywin32 — (main) pywin32/win32/src/win32file.i)

  • _get_dir_content (from point #1.) can be implemented using any of these approaches (some will require more work and some less)

    • Some advanced filtering (instead of just file vs. dir) could be done: e.g. the include_folders argument could be replaced by another one (e.g. filter_func) which would be a function that takes a path as an argument: filter_func=lambda x: True (this doesn’t strip out anything) and inside _get_dir_content something like: if not filter_func(entry_with_path): continue (if the function fails for one entry, it will be skipped), but the more complex the code becomes, the longer it will take to execute
  • Nota Bene! Since recursion is used, I must mention that I did some tests on my laptop (Win 10 pc064), totally unrelated to this problem, and when the recursion level was reaching values somewhere in the (990 .. 1000) range (recursionlimit — 1000 (default)), I got StackOverflow :). If the directory tree exceeds that limit (I am not an FS expert, so I don’t know if that is even possible), that could be a problem.
    I must also mention that I didn’t try to increase recursionlimit, but in theory there will always be the possibility for failure, if the dir depth is larger than the highest possible recursionlimit (on that machine).
    Check [SO]: _csv.Error: field larger than field limit (131072) (@CristiFati’s answer) for more details on the topic

  • Code samples are for demonstrative purposes only. That means that I didn’t take into account error handling (I don’t think there’s any try / except / else / finally block), so the code is not robust (the reason is: to keep it as simple and short as possible). For production, error handling should be added as well

Other approaches:

1. Use Python only as a wrapper

  • Everything is done using another technology

  • That technology is invoked from Python

  • The most famous flavor that I know is what I call the SysAdmin approach:

    • Use Python (or any programming language for that matter) in order to execute Shell commands (and parse their outputs)

    • Some consider this a neat hack

    • I consider it more like a lame workaround (gainarie), as the action per se is performed from Shell (Cmd in this case), and thus doesn’t have anything to do with Python

    • Filtering (grep / findstr) or output formatting could be done on both sides, but I’m not going to insist on it. Also, I deliberately used os.system instead of [Python.Docs]: subprocess — Subprocess management routines (run, check_output, …)

[prompt]> "e:WorkDevVEnvspy_pc064_03.05.04_test0Scriptspython.exe" -c "import os;os.system("dir /b root_dir")"
dir0
dir1
dir2
dir3
file0
file1

[cfati@cfati-5510-0:/mnt/e/Work/Dev/StackOverflow/q003207219]> python3.5 -c "import os;os.system("ls root_dir")"
dir0  dir1  dir2  dir3  file0  file1

In general, this approach is to be avoided, since if some command output format slightly differs between OS versions / flavors, the parsing code should be adapted as well — not to mention differences between locales.

I highly recommend this path module, written by Jason Orendorff:

http://pypi.python.org/pypi/path.py/2.2

Unfortunately, his website is down now, but you can still download from the above link (or through easy_install, if you prefer).

Using this path module, you can do various actions on paths, including the walking files you requested. Here’s an example:

from path import path

my_path = path('.')

for file in my_path.walkfiles():
    print file

for file in my_path.walkfiles('*.pdf'):
    print file

There are also convenience functions for many other things to do with paths:

In [1]: from path import path

In [2]: my_dir = path('my_dir')

In [3]: my_file = path('readme.txt')

In [5]: print my_dir / my_file
my_dir/readme.txt

In [6]: joined_path = my_dir / my_file

In [7]: print joined_path
my_dir/readme.txt

In [8]: print joined_path.parent
my_dir

In [9]: print joined_path.name
readme.txt

In [10]: print joined_path.namebase
readme

In [11]: print joined_path.ext
.txt

In [12]: joined_path.copy('some_output_path.txt')

In [13]: print path('some_output_path.txt').isfile()
True

In [14]: print path('some_output_path.txt').isdir()
False

There are more operations that can be done too, but these are some of the ones that I use most often. Notice that the path class inherits from string, so it can be used wherever a string is used. Also, notice that two or more path objects can easily be joined together by using the overridden / operator.

Hope this helps!

В этом посте будет обсуждаться, как перебирать файлы в каталоге в Python.

1. Использование os.listdir() функция

Простое решение для перебора файлов в каталоге — использование os.listdir() функция. Он возвращает список файлов и подкаталогов, присутствующих в указанном каталоге. Чтобы получить только файлы, вы можете отфильтровать список с помощью os.path.isfile() функция:

import os

directory = ‘path/to/dir’

for filename in os.listdir(directory):

    f = os.path.join(directory, filename)

    if os.path.isfile(f):

        print(f)

Скачать код

 
Чтобы получить файлы определенного расширения, скажите .txt, вы можете добавить условие для проверки расширения файла.

import os

directory = ‘path/to/dir’

for filename in os.listdir(directory):

    f = os.path.join(directory, filename)

    if os.path.isfile(f) and filename.endswith(‘.txt’):

        print(f)

Скачать код

2. Использование os.scandir() функция

Начиная с Python 3.5, рассмотрите возможность использования os.scandir() функция, когда вам нужна информация о типе файла или атрибуте файла. Он возвращает записи каталога и информацию об атрибутах файла, обеспечивая значительно более высокую производительность по сравнению с os.listdir().

import os

directory = ‘path/to/dir’

for entry in os.scandir(directory):

    if entry.is_file() and entry.name.endswith(‘.txt’):

        print(entry.path)

Скачать код

3. Использование pathlib модуль

В Python 3.4 вы также можете использовать pathlib модуль. Чтобы перебрать файлы в каталоге, используйте Path.glob(pattern) функция, которая размещает заданный относительный шаблон в указанном каталоге и дает соответствующие файлы.

В следующем примере показано, как фильтровать и отображать текстовые файлы, находящиеся в каталоге.

from pathlib import Path

directory = ‘path/to/dir’

pathlist = Path(directory).glob(‘*.txt’)

for path in pathlist:

    print(path)

Скачать код

 
В качестве альтернативы вы можете использовать Path.iterdir() функция, которая возвращает объекты пути содержимого каталога. Чтобы получить расширение файла, используйте suffix имущество:

from pathlib import Path

directory = ‘path/to/dir’

for path in Path(directory).iterdir():

    if path.is_file() and path.suffix == ‘.txt’:

        print(path)

Скачать код

4. Использование os.walk() функция

Если вам нужно также искать подкаталоги, рассмотрите возможность использования os.walk() функция. Это дает 3-кортеж (dirpath, dirnames, filenames) для всего, что доступно из указанного каталога, где dirpath это путь к каталогу, dirnames это список имен подкаталогов в dirpath, и filenames представляет собой список имен файлов, не входящих в каталоги, в каталоге dirpath.

import os

directory = ‘path/to/dir’

for root, dirs, files in os.walk(directory):

    for file in files:

        if file.endswith(‘.txt’):

            print(os.path.join(root, file))

Скачать код

 
Начиная с Python 3.5, os.walk() звонки os.scandir() вместо os.listdir(), что делает его быстрее за счет уменьшения общего количества вызовов os.stat().

5. Использование glob модуль

Наконец, вы можете использовать glob.iglob функция, которая возвращает итератор по списку путей, соответствующих указанному шаблону.

import glob

directory = ‘path/to/dir’

for path in glob.iglob(f‘{directory}/*.txt’):

    print(path)

Скачать код

 
Расширенная поддержка Python версии 3.5 для рекурсивных глобусов с использованием ** что позволяет вам искать подкаталоги и символические ссылки на каталоги.

import glob

directory = ‘path/to/dir’

for path in glob.iglob(f‘{directory}/**/*.txt’, recursive=True):

    print(path)

Скачать код

Это все, что касается перебора файлов в каталоге в Python.

Getting a list of all the files and folders in a directory is a natural first step for many file-related operations in Python. When looking into it, though, you may be surprised to find various ways to go about it.

When you’re faced with many ways of doing something, it can be a good indication that there’s no one-size-fits-all solution to your problems. Most likely, every solution will have its own advantages and trade-offs. This is the case when it comes to getting a list of the contents of a directory in Python.

In this tutorial, you’ll be focusing on the most general-purpose techniques in the pathlib module to list items in a directory, but you’ll also learn a bit about some alternative tools.

Before pathlib came out in Python 3.4, if you wanted to work with file paths, then you’d use the os module. While this was very efficient in terms of performance, you had to handle all the paths as strings.

Handling paths as strings may seem okay at first, but once you start bringing multiple operating systems into the mix, things get more tricky. You also end up with a bunch of code related to string manipulation, which can get very abstracted from what a file path is. Things can get cryptic pretty quickly.

That’s not to say that working with paths as strings isn’t feasible—after all, developers managed fine without pathlib for many years! The pathlib module just takes care of a lot of the tricky stuff and lets you focus on the main logic of your code.

It all begins with creating a Path object, which will be different depending on your operating system (OS). On Windows, you’ll get a WindowsPath object, while Linux and macOS will return PosixPath:

  • Windows
  • Linux + macOS

>>>

>>> import pathlib
>>> desktop = pathlib.Path("C:/Users/RealPython/Desktop")
>>> desktop
WindowsPath("C:/Users/RealPython/Desktop")

>>>

>>> import pathlib
>>> desktop = pathlib.Path("/home/RealPython/Desktop")
>>> desktop
PosixPath('/home/RealPython/Desktop')

With these OS-aware objects, you can take advantage of the many methods and properties available, such as ones to get a list of files and folders.

Now, it’s time to dive into listing folder contents. Be aware that there are several ways to do this, and picking the right one will depend on your specific use case.

Getting a List of All Files and Folders in a Directory in Python

Before getting started on listing, you’ll want a set of files that matches what you’ll encounter in this tutorial. In the supplementary materials, you’ll find a folder called Desktop. If you plan to follow along, download this folder and navigate to the parent folder and start your Python REPL there:

You could also use your own desktop too. Just start the Python REPL in the parent directory of your desktop, and the examples should work, but you’ll have your own files in the output instead.

If you only need to list the contents of a given directory, and you don’t need to get the contents of each subdirectory too, then you can use the Path object’s .iterdir() method. If your aim is to move through directories and subdirectories recursively, then you can jump ahead to the section on recursive listing.

The .iterdir() method, when called on a Path object, returns a generator that yields Path objects representing child items. If you wrap the generator in a list() constructor, then you can see your list of files and folders:

>>>

>>> import pathlib
>>> desktop = pathlib.Path("Desktop")

>>> # .iterdir() produces a generator
>>> desktop.iterdir()
<generator object Path.iterdir at 0x000001A8A5110740>

>>> # Which you can wrap in a list() constructor to materialize
>>> list(desktop.iterdir())
[WindowsPath('Desktop/Notes'),
 WindowsPath('Desktop/realpython'),
 WindowsPath('Desktop/scripts'),
 WindowsPath('Desktop/todo.txt')]

Passing the generator produced by .iterdir() to the list() constructor provides you with a list of Path objects representing all the items in the Desktop directory.

As with all generators, you can also use a for loop to iterate over each item that the generator yields. This gives you the chance to explore some of the properties of each object:

>>>

>>> desktop = pathlib.Path("Desktop")
>>> for item in desktop.iterdir():
...     print(f"{item} - {'dir' if item.is_dir() else 'file'}")
...
DesktopNotes - dir
Desktoprealpython - dir
Desktopscripts - dir
Desktoptodo.txt - file

Within the for loop body, you use an f-string to display some information about each item.

In the second set of curly braces ({}) in the f-string, you use a conditional expression to print dir if the item is a directory, or file if it isn’t. To get this information, you use the .is_dir() method.

Putting a Path object in an f-string automatically casts the object to a string, which is why you no longer have the WindowsPath or PosixPath annotation.

Iterating over the object deliberately with a for loop like this can be very handy for filtering by either files or directories, as in the following example:

>>>

>>> desktop = pathlib.Path("Desktop")
>>> for item in desktop.iterdir():
...     if item.is_file():
...         print(item)
...
Desktoptodo.txt

Here, you use a conditional statement and the .is_file() method to only print the item if it’s a file.

You can also place generators into comprehensions, which can make for very concise code:

>>>

>>> desktop = pathlib.Path("Desktop")
>>> [item for item in desktop.iterdir() if item.is_dir()]
[WindowsPath('Desktop/Notes'),
 WindowsPath('Desktop/realpython'),
 WindowsPath('Desktop/scripts')]

Here, you’re filtering the resulting list by using a conditional expression inside the comprehension to check if the item is a directory.

But what if you need all the files and directories in the subdirectories of your folder too? You can adapt .iterdir() as a recursive function, as you’ll do later in the tutorial, but you may be better off using .rglob(), which you’ll get into next.

Recursively Listing With .rglob()

Directories are often compared with trees because of their recursive nature. In trees, the main trunk splits off into various main branches. Each main branch splits off into further sub-branches. Each sub-branch branches off itself too, and so on. Likewise, directories contain subdirectories, which contain subdirectories, which contain more subdirectories, on and on.

To recursively list the items in a directory means to list not only the directory’s contents, but also the contents of the subdirectories, their subdirectories, and so on.

With pathlib, it’s surprisingly easy to recurse through a directory. You can use .rglob() to return absolutely everything:

>>>

>>> import pathlib
>>> desktop = pathlib.Path("Desktop")

>>> # .rglob() produces a generator too
>>> desktop.rglob("*")
<generator object Path.glob at 0x000001A8A50E2F00>

>>> # Which you can wrap in a list() constructor to materialize
>>> list(desktop.rglob("*"))
[WindowsPath('Desktop/Notes'),
 WindowsPath('Desktop/realpython'),
 WindowsPath('Desktop/scripts'),
 WindowsPath('Desktop/todo.txt'),
 WindowsPath('Desktop/Notes/hash-tables.md'),
 WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md'),
 WindowsPath('Desktop/scripts/rename_files.py'),
 WindowsPath('Desktop/scripts/request.py')]

The .rglob() method with "*" as an argument produces a generator that yields all the files and folders from the Path object recursively.

But what’s with the asterisk argument to .rglob()? In the next section, you’ll look into glob patterns and see how you can do more than just list all the items in a directory.

Using a Python Glob Pattern for Conditional Listing

Sometimes you don’t want all the files. There are times when you just want one type of file or directory, or perhaps all the items with a certain pattern of characters in their name.

A method related to .rglob() is the .glob() method. Both of these methods make use of glob patterns. A glob pattern represents a collection of paths. Glob patterns make use of wildcard characters to match on certain criteria. For example, the single asterisk * matches everything in the directory.

There are many different glob patterns that you can take advantage of. Check out the following selection of glob patterns for some ideas:

Glob Pattern Matches
* Every item
*.txt Every item ending in .txt, such as notes.txt or hello.txt
?????? Every item whose name is six characters long, such as 01.txt, A-01.c, or .zshrc
A* Every item that starts with the character A, such as Album, A.txt, or AppData
[abc][abc][abc] Every item whose name is three characters long but only composed of the characters a, b, and c, such as abc, aaa, or cba

With these patterns, you can flexibly match many different types of files. Check out the documentation on fnmatch, which is the underlying module governing the behavior of .glob(), to get a feel for the other patterns that you can use in Python.

Note that on Windows, glob patterns are case-insensitive, because paths are case-insensitive in general. On Unix-like systems like Linux and macOS, glob patterns are case-sensitive.

Conditional Listing Using .glob()

The .glob() method of a Path object behaves in much the same way as .rglob(). If you pass the "*" argument, then you’ll get a list of items in the directory, but without recursion:

>>>

>>> import pathlib
>>> desktop = pathlib.Path("Desktop")

>>> # .glob() produces a generator too
>>> desktop.glob("*")
<generator object Path.glob at 0x000001A8A50E2F00>

>>> # Which you can wrap in a list() constructor to materialize
>>> list(desktop.glob("*"))
[WindowsPath('Desktop/Notes'),
 WindowsPath('Desktop/realpython'),
 WindowsPath('Desktop/scripts'),
 WindowsPath('Desktop/todo.txt')]

Using the .glob() method with the "*" glob pattern on a Path object produces a generator that yields all the items in the directory that’s represented by the Path object, without going into the subdirectories. In this way, it produces the same result as .iterdir(), and you can use the resulting generator in a for loop or a comprehension, just as you would with iterdir().

But as you already learned, what really sets the glob methods apart are the different patterns that you can use to match only certain paths. If you only wanted paths that ended with .txt, for example, then you could do the following:

>>>

>>> desktop = pathlib.Path("Desktop")
>>> list(desktop.glob("*.txt"))
[WindowsPath('Desktop/todo.txt')]

Since this directory only has one text file, you get a list with just one item. If you wanted to get only items that start with real, for example, then you could use the following glob pattern:

>>>

>>> list(desktop.glob("real*"))
[WindowsPath('Desktop/realpython')]

This example also only produces one item, because only one item’s name starts with the characters real. Remember that on Unix-like systems, glob patterns are case-sensitive.

You can also get the contents of a subdirectory by including its name, a forward slash (/), and an asterisk. This type of pattern will yield everything inside the target directory:

>>>

>>> list(desktop.glob("realpython/*"))
[WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md')]

In this example, using the "realpython/*" pattern yields all the files within the realpython directory. It’ll give you the same result as creating a path object representing the Desktop/realpython path and calling .glob("*") on it.

Next up, you’ll look a bit further into filtering with .rglob() and learn how it differs from .glob().

Conditional Listing Using .rglob()

Just the same as with the .glob() method, you can adjust the glob pattern of .rglob() to give you only a certain file extension, except that .rglob() will always search recursively:

>>>

>>> list(desktop.rglob("*.md"))
[WindowsPath('Desktop/Notes/hash-tables.md'),
 WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md')]

By adding .md to the glob pattern, now .rglob() produces only .md files across different directories and subdirectories.

You can actually use .glob() and get it to behave in the same way as .rglob() by adjusting the glob pattern passed as an argument:

>>>

>>> list(desktop.glob("**/*.md"))
[WindowsPath('Desktop/Notes/hash-tables.md'),
 WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md')]

In this example, you can see that the call to .glob("**/*.md") is equivalent to .rglob(*.md). Likewise, a call to .glob("**/*") is equivalent to .rglob("*").

The .rglob() method is a slightly more explicit version of calling .glob() with a recursive pattern, so it’s probably better practice to use the more explicit version instead of using recursive patterns with the normal .glob().

Advanced Matching With the Glob Methods

One of the potential drawbacks with the glob methods is that you can only select files based on glob patterns. If you want to do more advanced matching or filter on the attributes of the item, then you need to reach for something extra.

To run more complex matching and filtering, you can follow at least three strategies. You can use:

  1. A for loop with a conditional check
  2. A comprehension with a conditional expression
  3. The built-in filter() function

Here’s how:

>>>

>>> import pathlib
>>> desktop = pathlib.Path("Desktop")

>>> # Using a for loop
>>> for item in desktop.rglob("*"):
...     if item.is_file():
...         print(item)
...
Desktoptodo.txt
DesktopNoteshash-tables.md
Desktoprealpythoniterate-dict.md
Desktoprealpythontictactoe.md
Desktopscriptsrename_files.py
Desktopscriptsrequest.py

>>> # Using a comprehension
>>> [item for item in desktop.rglob("*") if item.is_file()]
[WindowsPath('Desktop/todo.txt'),
 WindowsPath('Desktop/Notes/hash-tables.md'),
 WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md'),
 WindowsPath('Desktop/scripts/rename_files.py'),
 WindowsPath('Desktop/scripts/request.py')]

>>> # Using the filter() function
>>> list(filter(lambda item: item.is_file(), desktop.rglob("*")))
[WindowsPath('Desktop/todo.txt'),
 WindowsPath('Desktop/Notes/hash-tables.md'),
 WindowsPath('Desktop/realpython/iterate-dict.md'),
 WindowsPath('Desktop/realpython/tictactoe.md'),
 WindowsPath('Desktop/scripts/rename_files.py'),
 WindowsPath('Desktop/scripts/request.py')]

In these examples, you’ve first called the .rglob() method with the "*" pattern to get all the items recursively. This produces all the items in the directory and its subdirectories. Then you use the three different approaches listed above to filter out the items that aren’t files. Note that in the case of filter(), you’ve used a lambda function.

The glob methods are extremely versatile, but for large directory trees, they can be a bit slow. In the next section, you’ll be examining an example in which reaching for more controlled iteration with .iterdir() may be a better choice.

Opting Out of Listing Junk Directories

Say, for example, that you wanted to find all the files on your system, but you have various subdirectories that have lots and lots of subdirectories and files. Some of the largest subdirectories are temporary files that you aren’t interested in.

For example, examine this directory tree that has junk directories—lots of them! In reality, this full directory tree is 1,850 lines long. Wherever you see an ellipsis (...), that means that there are hundreds of junk files at that location:

large_dir/
├── documents/
│   ├── notes/
│   │   ├── temp/
│   │   │   ├── 2/
│   │   │   │   ├── 0.txt
│   │   │   │   ...
│   │   │   │
│   │   │   ├── 0.txt
│   │   │   ...
│   │   │
│   │   ├── 0.txt
│   │   └── find_me.txt
│   │
│   ├── tools/
│   │   ├── temporary_files/
│   │   │   ├── logs/
│   │   │   │   ├──0.txt
│   │   │   │   ...
│   │   │   │
│   │   │   ├── temp/
│   │   │   │   ├──0.txt
│   │   │   │   ...
│   │   │   │
│   │   │   ├── 0.txt
│   │   │   ...
│   │   │
│   │   ├── 33.txt
│   │   ├── 34.txt
│   │   ├── 36.txt
│   │   ├── 37.txt
│   │   └── real_python.txt
│   │
│   ├── 0.txt
│   ├── 1.txt
│   ├── 2.txt
│   ├── 3.txt
│   └── 4.txt
│
├── temp/
│   ├── 0.txt
│   ...
│
└── temporary_files/
    ├── 0.txt
    ...

The issue here is that you have junk directories. The junk directories are sometimes called temp, sometimes temporary files, and sometimes logs. What makes it worse is that they’re everywhere and can be at any level of nesting. The good news is that you don’t have to list them, as you’ll learn next.

Using .rglob() to Filter Whole Directories

If you use .rglob(), you can just filter out the items once they’re produced by .rglob(). To properly discard paths that are in a junk directory, you can check if any of the elements in the path match with any of the elements in a list of directories to skip:

>>>

>>> SKIP_DIRS = ["temp", "temporary_files", "logs"]

Here, you’re defining SKIP_DIRS as a list that contains the strings of the paths that you want to exclude.

A call to .rglob() with a bare asterisk as an argument will produce all the items, even those in the directories that you aren’t interested in. Because you have to go through all the items, there’s a potential issue if you only look at the name of a path:

large_dir/documents/notes/temp/2/0.txt

Since the name is just 0.txt, it wouldn’t match any items in SKIP_DIRS. You’d need to check the whole path for the blocked name.

You can get all the elements in the path with the .parts attribute, which contains a tuple of all the elements in the path:

>>>

>>> import pathlib
>>> temp_file = pathlib.Path("large_dir/documents/notes/temp/2/0.txt")
>>> temp_file.parts
('large_dir', 'documents', 'notes', 'temp', '2', '0.txt')

Then, all you need to do is to check if any element in the .parts tuple is in the list of directories to skip.

You can check if any two iterables have an item in common by taking advantage of sets. If you cast one of the iterables to a set, then you can use the .isdisjoint() method to determine whether they have any elements in common:

>>>

>>> {"documents", "notes", "find_me.txt"}.isdisjoint({"temp", "temporary"})
True

>>> {"documents", "temp", "find_me.txt"}.isdisjoint({"temp", "temporary"})
False

If the two sets have no elements in common, then .isdisjoint() returns True. If the two sets have at least one element in common, then .isdisjoint() returns False. You can incorporate this check into a for loop that goes over all the items returned by .rglob("*"):

>>>

>>> SKIP_DIRS = ["temp", "temporary_files", "logs"]
>>> large_dir = pathlib.Path("large_dir")

>>> # With a for loop
>>> for item in large_dir.rglob("*"):
...     if set(item.parts).isdisjoint(SKIP_DIRS):
...         print(item)
...
large_dirdocuments
large_dirdocuments.txt
large_dirdocuments1.txt
large_dirdocuments2.txt
large_dirdocuments3.txt
large_dirdocuments4.txt
large_dirdocumentsnotes
large_dirdocumentstools
large_dirdocumentsnotes.txt
large_dirdocumentsnotesfind_me.txt
large_dirdocumentstools33.txt
large_dirdocumentstools34.txt
large_dirdocumentstools36.txt
large_dirdocumentstools37.txt
large_dirdocumentstoolsreal_python.txt

In this example, you print all the items in large_dir that aren’t in any of the junk directories.

To check if the path is in one of the unwanted folders, you cast item.parts to a set and use .isdisjoint() to check if SKIP_DIRS and .parts don’t have any items in common. If that’s the case, then the item gets printed.

You can also accomplish the same effect with filter() and comprehensions, as below:

>>>

>>> # With a comprehension
>>> [
...     item
...     for item in large_dir.rglob("*")
...     if set(item.parts).isdisjoint(SKIP_DIRS)
... ]

>>> # With filter()
>>> list(
...     filter(
...         lambda item: set(item.parts).isdisjoint(SKIP_DIRS),
...         large_dir.rglob("*")
...     )
... )

These methods are already getting a bit cryptic and hard to follow, though. Not only that, but they aren’t very efficient, because the .rglob() generator has to produce all the items so that the matching operation can discard that result.

You can definitely filter out whole folders with .rglob(), but you can’t get away from the fact that the resulting generator will yield all the items and then filter out the undesirable ones, one by one. This can make the glob methods very slow, depending on your use case. That’s why you might opt for a recursive .iterdir() function, which you’ll explore next.

Creating a Recursive .iterdir() Function

In the example of junk directories, you ideally want the ability to opt out of iterating over all the files in a given subdirectory if they match one of the names in SKIP_DIRS:

# skip_dirs.py

import pathlib

SKIP_DIRS = ["temp", "temporary_files", "logs"]

def get_all_items(root: pathlib.Path, exclude=SKIP_DIRS):
    for item in root.iterdir():
        if item.name in exclude:
            continue
        yield item
        if item.is_dir():
            yield from get_all_items(item)

In this module, you define a list of strings, SKIP_DIRS, that contains the names of directories that you’d like to ignore. Then you define a generator function that uses .iterdir() to go over each item.

The generator function uses the type annotation : pathlib.Path after the first argument to indicate that you can’t just pass in a string that represents a path. The argument needs to be a Path object.

If the item name is in the exclude list, then you just move on to the next item, skipping the whole subdirectory tree in one go.

If the item isn’t in the list, then you yield the item, and if it’s a directory, you invoke the function again on that directory. That is, within the function body, the function conditionally invokes the same function again. This is a hallmark of a recursive function.

This recursive function efficiently yields all the files and directories that you want, excluding all that you aren’t interested in:

>>>

>>> import pathlib
>>> import skip_dirs
>>> large_dir = pathlib.Path("large_dir")

>>> list(skip_dirs.get_all_items(large_dir))
[WindowsPath('large_dir/documents'),
 WindowsPath('large_dir/documents/0.txt'),
 WindowsPath('large_dir/documents/1.txt'),
 WindowsPath('large_dir/documents/2.txt'),
 WindowsPath('large_dir/documents/3.txt'),
 WindowsPath('large_dir/documents/4.txt'),
 WindowsPath('large_dir/documents/notes'),
 WindowsPath('large_dir/documents/notes/0.txt'),
 WindowsPath('large_dir/documents/notes/find_me.txt'),
 WindowsPath('large_dir/documents/tools'),
 WindowsPath('large_dir/documents/tools/33.txt'),
 WindowsPath('large_dir/documents/tools/34.txt'),
 WindowsPath('large_dir/documents/tools/36.txt'),
 WindowsPath('large_dir/documents/tools/37.txt'),
 WindowsPath('large_dir/documents/tools/real_python.txt')]

Crucially, you’ve managed to opt out of having to examine all the files in the undesired directories. Once your generator identifies that the directory is in the SKIP_DIRS list, it just skips the whole thing.

So, in this case, using .iterdir() is going to be far more efficient than the equivalent glob methods.

In fact, you’ll find that .iterdir() is generally more efficient than the glob methods if you need to filter on anything more complex than can be achieved with a glob pattern. However, if all you need to do is to get a list of all the .txt files recursively, then the glob methods will be faster.

Check out the downloadable materials for some tests that demonstrate the relative speed of different ways to list files in Python:

With that information under your belt, you’ll be ready to select the best way to list the files and folders that you need!

Conclusion

In this tutorial, you’ve explored the .glob(), .rglob(), and .iterdir() methods from the Python pathlib module to get all the files and folders in a given directory into a list. You’ve covered listing the files and folders that are direct descendants of the directory, and you’ve also looked at recursive listing.

In general, you’ve seen that if you just need a basic list of the items in the directory, without recursion, then .iterdir() is the cleanest method to use, thanks to its descriptive name. It’s also more efficient at this job. If, however, you need a recursive list, then you’re best to go with .rglob(), which will be faster than an equivalent recursive .iterdir().

You’ve also examined one example in which using .iterdir() to list recursively can produce a huge performance benefit—when you have junk folders that you want to opt out of iterating over.

In the downloadable materials, you’ll find various implementations of methods to get a basic list of files from both the pathlib and os modules, along with a couple scripts that time them all against one another:

Check them out, modify them, and share your findings in the comments!

21 января 2022 г. | Python

Всё чаще современные программисты предпочитают работать с языком программирования Python, потому что он очень гибкий, позволяющий легко взаимодействовать с операционной системой. Он также поставляется с функциями по работе с файловой системой. Решение задачи распечатки списка файлов в директории можно решить используя разные модули: os, subprocess, fnmatch и pathlib. Следующие решения демонстрируют, как успешно воспользоваться этими модулями.

Применение os.walk()

Модуль os содержит длинный список методов, которые касаются работы с файловой системой и операционной системой. Один из них walk(), возвращающий имена файлов в дереве каталогов, двигаясь по дереву сверху вниз или снизу вверх (сверху вниз по умолчанию).

os.walk() возвращает список из трех элементов: имя корневого каталога, список имен вложенных папок и список файлов в текущем каталоге. Он одинаково хорошо работает с интерпретаторами Python 2 и 3.

import os
for root, dirs, files in os.walk("."):  
    for filename in files:
        print(filename)

Использование командной строки, через subprocess

Модуль subprocess позволяет выполнить системную команду и собрать её результат. В нашем случае вызываемая системная команда выглядит следующим образом:

$ ls -p . | grep -v /$

Инструкция ls -p . распечатывает список файлов текущего каталога, добавляя разделитель / в конце имени каждого подкаталога, которые нам понадобится на следующем шаге. Вывод этого вызова передается команде grep, которая отфильтровывает данные по мере поступления.
Параметры -v / $ исключают все имена записей, которые заканчиваются разделителем /. Фактически / $ — регулярное выражение, которое соответствует всем строкам, содержащим символ / самым последним символом в строке, который определяется символом $.

Модуль subprocess позволяет строить настоящие конвейеры, а также соединять входные и выходные потоки, как это делается в командной строке. Вызов метода subprocess.Popen() открывает соответствующий процесс и определяет два параметра stdin и stdout.

Первая переменная ls определяет процесс выполнения ls –p для захвата stdout в конвейере. Поэтому поток stdout определяется как subprocess.PIPE. Вторая переменная grep также определяется как процесс, но вместо этого выполняет инструкцию grep –v /$.

Чтобы прочитать вывод команды ls из конвейера, поток stdin grep присваиваивается в ls.stdout. В заключение, переменная endOfPipe считывает вывод команды grep из grep.stdout, затем распечатывается в stdout циклом for.

import subprocess
# определение команды ls 
ls = subprocess.Popen(["ls", "-p", "."],  
                      stdout=subprocess.PIPE,
                     )
# определение команды grep
grep = subprocess.Popen(["grep", "-v", "/$"],  
                        stdin=ls.stdout,
                        stdout=subprocess.PIPE,
                        )
# чтение из данных из потока stdout
endOfPipe = grep.stdout

# распечатка файлов в строку
for line in endOfPipe:  
    print(line)

Запуск файла

$ python find-files3.py
find-files2.py  
find-files3.py  
find-files4.py  
...

Данное решение работает достаточно хорошо с Python 2 и 3, но его можно улучшить. Рассмотрим другие варианты.

Комбинация os и fnmatch

Решение, использующее подпроцессы, элегантно, но требует большого количества кода. Вместо этого, давайте объединим методы из двух модулей os и fnmatch. Этот вариант также работает с Python 2 и 3.

В качестве первого шага, импортируем модули os и fnmatch. Далее определим каталог, в котором нужно перечислить файлы, используя os.listdir(), а также шаблон для фильтрации файлов. В цикле for выполняется итерация списка записей, хранящихся в переменной listOfFiles.

В завершение, с помощью fnmatch отфильтровываются искомые записи и распечатываются соответствующие записи в stdout.

import os, fnmatch
listOfFiles = os.listdir('.')  
pattern = "*.py"  
for entry in listOfFiles:  
    if fnmatch.fnmatch(entry, pattern):
            print(entry)

Результат выполнения

$ python find-files.py
find-files.py  
find-files2.py  
find-files3.py  
...

Использование os.listdir() и генераторов

Следующий вариант объединяет метод os.listdir() с функцией генератором. Код работает как с версиями 2, так и с 3 Python.

Как уже было сказано ранее, listdir() возвращает список записей для данного каталога. Метод os.path.isfile() возвращает True, если данная запись является файлом. Оператор yield завершает работу функции, но сохраняя текущее состояние и возвращает только имя записи являющейся файлом.

import os
def files(path):
    for file in os.listdir(path):
        if os.path.isfile(os.path.join(path, file)):
            yield file

for file in files("."):  
    print(file)

Использование pathlib

Модуль pathlib предназначен для парсинга, сборки, тестирования и иной работы с именами файлов и их путями, используя объектно-ориентированный API вместо низкоуровневых строковых операций. Начиная с Python 3 модуль находится в стандартной библиотеке.

В следующем листинге определяется текущий каталог точкой («.»). Затем метод iterdir() возвращает итератор, который возвращает имена всех файлов. Далее циклом for распечатываются имена файлов друг за другом.

import pathlib
# определение пути
currentDirectory = pathlib.Path('.')
for currentFile in currentDirectory.iterdir():  
    print(currentFile)

В качестве альтернативы, можно отфильтровать файлы по именам с помощью метода glob. Таким образом, получаем требуемые файлы. Например, в приведенном ниже коде перечисляются Python файлы в выбранном каталоге, указав шаблон «*.py» в glob.

import pathlib
# определение пути
currentDirectory = pathlib.Path('.')

# определение шаблона
currentPattern = "*.py"
for currentFile in currentDirectory.glob(currentPattern):  
    print(currentFile)

Использование os.scandir()

В Python 3.6 добавлен новый метод scandir(), доступный из модуля os. Как понятно из названия он значительно упрощает получение списка файлов в каталоге.

Чтобы определить текущий рабочий каталог и сохранить его, инициализируем значение переменной path, для этого импортируем модуль os и вызовем функцию getcwd(). Далее, scandir() возвращает список записей для выбранного пути, которые проверяются на принадлежность файлу, используя метод is_file().

import os
# определение текущей рабочей директории
path = os.getcwd()
# чтение записей
with os.scandir(path) as listOfEntries:  
    for entry in listOfEntries:
        # печать всех записей, являющихся файлами
        if entry.is_file():
            print(entry.name)

Вывод

Ведутся споры, какой вариант является лучшим, какой наиболее элегантным и какой является наиболее «питоничным». Мне нравится простота метода os.walk(), а также модули fnmatch и pathlib.

Две версии с процессами/конвейером и итератором требуют более глубокого понимания процессов UNIX и знаний Python, поэтому они не могут быть предпочтительными для всех программистов из-за их дополнительной (и избыточной) сложности.

Чтобы найти ответ на этот вопрос, выберем самой быстрой из них, воспользовавшись удобным модулем timeit. Данный модуль подсчитывает время, прошедшее между двумя событиями.

Для сравнения всех решений без их изменений, воспользуемся функциональностью Python: вызовем интерпретатор с модулем timeit и соответствующим Python скриптом. Для автоматизации процесса напишем shell скрипт

#! /bin/bash

for filename in *.py; do  
    echo "$filename:"
    cat $filename | python3 -m timeit
    echo " "
done

Тесты проводились с использованием Python 3.6. Среди всех тестов os.walk() показала себя наилучшим образом. Выполнение тестов с помощью Python 2 возвращает разные значения, но os.walk() по-прежнему находится на вершине списка.

Понравилась статья? Поделить с друзьями:
  • Как найти ответы в студопедия
  • Как с помощью тангенса найти высоту
  • Как найти центр сечения треугольника
  • Как найти по фамилии жилье свое
  • Зеленоватый оттенок волос после окрашивания как исправить