三行python代码，永久消除linux解压zip包乱码-代码图书馆

背景

写了一段代码，需要调用python的shutil标准库解压zip压缩包，具体代码如下

import shutil


def unzip(self, src_path: str, dst_path: str):
    # shutil.unpack_archive("../README.md.zip", "../")
    # shutil.unpack_archive("../docxx.zip", "../")
    shutil.unpack_archive(src_path, dst_path)

结果发现解压后的文件名称出现了乱码，但是文件内容是正常的，没有出现乱码

python代码的运行环境是ubnutu，发现使用unzip命令解压，也会出现这个问题

$ unzip HEAP.zip 
Archive:  HEAP.zip
  inflating: 20230329-SOC│╡╘╞╥╗╠х╗у▒и-1.pptx  
  inflating: 20230330-EEA-╒√│╡╡ч╫╙╡ч╞°╝▄╣╣ v1.pptx  
  inflating: 20230412-╬╩╠т╨▐╕─.md

而在windows环境下，就不会出现这种问题，代码是正常的，而巧合的是这个压缩包也是windows下打的，所以基本可以明确，这是由于不同操作系统的默认编码不同导致的

问题解决

shutil标准库解压zip包，调用的是zipfile标准库，调用代码如下

def _unpack_zipfile(filename, extract_dir):
    """Unpack zip `filename` to `extract_dir`
    """
    import zipfile  # late import for breaking circular dependency

    if not zipfile.is_zipfile(filename):
        raise ReadError("%s is not a zip file" % filename)

    zip = zipfile.ZipFile(filename)
    try:
        for info in zip.infolist():
            name = info.filename

            # don't extract absolute paths or ones with .. in them
            if name.startswith('/') or '..' in name:
                continue

            target = os.path.join(extract_dir, *name.split('/'))
            if not target:
                continue

            _ensure_directory(target)
            if not name.endswith('/'):
                # file
                data = zip.read(info.filename)
                f = open(target, 'wb')
                try:
                    f.write(data)
                finally:
                    f.close()
                    del data
    finally:
        zip.close()

其中问题就出现在这行代码

            name = info.filename

读取了文件名，但是没有按照正确的编码格式进行解码，只要如下处理就可以解决这个问题

            if info.flag_bits & 0x800:  # #utf-8 #编码
                name = info.filename
            else:
                try:
                    # zipfile 默认使用 #cp437 编码 & #utf-8 编码
                    name = info.filename.encode('cp437').decode('gbk')  # gbk编码兼容ASCII
                except UnicodeDecodeError as e:
                    name = info.filename

问题原因也很简单，获取文件名之后，没有按照windows下的编码格式，而是使用了cp437，所以出现了乱码

info.flag_bits是一个标志位，其中的一位是用于判断是否使用utf-8编码，详解见下小节。

有的教程会教大家如何修改python标准库的源码，以解决这个问题，但是这是一种很危险的操作，不建议如此。

我采用的方案是通过shutil.unregister_unpack_format()和shutil.register_unpack_format()方法动态的替换运行时解压zip包的函数。

show me code

完整代码如下

def _unpack_zipfile(filename, extract_dir):
    """Unpack zip `filename` to `extract_dir`
    """
    import zipfile  # late import for breaking circular dependency

    if not zipfile.is_zipfile(filename):
        raise shutil.ReadError("%s is not a zip file" % filename)

    zip = zipfile.ZipFile(filename)
    try:
        for info in zip.infolist():
            # name = info.filename

            # 支持windows下的打得zip包 不会乱码 ==========================
            if info.flag_bits & 0x800:  # #utf-8 #编码
                name = info.filename
            else:
                try:
                    # zipfile 默认使用 #cp437 编码 & #utf-8 编码
                    name = info.filename.encode('cp437').decode('gbk')  # gbk编码兼容ASCII
                except UnicodeDecodeError as e:
                    name = info.filename
            # ========================================================

            # don't extract absolute paths or ones with .. in them
            if name.startswith('/') or '..' in name:
                continue

            target = os.path.join(extract_dir, *name.split('/'))
            if not target:
                continue

            ensure_dir(target)
            if not name.endswith('/'):
                # file
                data = zip.read(info.filename)
                f = open(target, 'wb')
                try:
                    f.write(data)
                finally:
                    f.close()
                    del data
    finally:
        zip.close()


shutil.unregister_unpack_format('zip')
shutil.register_unpack_format('zip', ['.zip'], _unpack_zipfile, [], "ZIP file")

原因分析

知其然，还要知其所以然

zip(压缩文件格式)是一种古老的规范，最早出现在ibm的dos系统下，zip属于当前几种主流的压缩格式之一。当年的dos不能像今天这样支持unicode和utf-8编码，不同国家的电脑需要安装不同的代码页(code page)，并只能兼容当地(国家/地区)的文字。在这种情况下，zip和dos一样，设计初期并没有考虑unicode统一编码的问题，所以压缩时候会按照各个操作系统默认编码存储文件。

现如今，随着新的unicode和utf-8编码的兴盛，越来越多的系统开始支持utf-8规范(这是一种可以支持全球所有文字的编码方式)。zip中也增加了新的标志位，用来表示zip文件的压缩编码是否是utf-8。然而，主流操作系统针对zip的压缩功能代码年久失修，很多功能都没有遵从最新的zip标准，不同操作系统的文件系统对编码格式支持不统一。如linux下默认不支持gbk编码；windows操作系统的中文默认编码为gbk，并且至今windows 10依旧采用兼容代码页(code page)的方式判断系统语言，因此windows的zip压缩会使用本地码压缩(默认是gbk编码)，而不会开启utf-8标志位，但会使用zip一个特殊的功能“zip拓展文件名字段”，并在拓展字段里使用“utf-8”编码的文件名；而macos操作系统虽然采用中文默认编码utf-8，但因为mac的代码页(code page)就是utf-8，所以压缩的时候按照utf-8压缩，且不会开启utf-8标志位。并且，不同操作系统对大/小写文件名识别的方式也不一致，如linux下区分大小写，mac、windows下默认不区分大小写。

由于文件识别系统无法获知将要解码的zip文件是由哪种系统编码的，也就无法提供与zip文件相匹配的解码方式。