为什么我需要用b来用Base64编码字符串?

2020/10/31 16:42 · python ·  · 0评论

在此python示例之后,我使用以下代码将字符串编码为Base64:

>>> import base64
>>> encoded = base64.b64encode(b'data to be encoded')
>>> encoded
b'ZGF0YSB0byBiZSBlbmNvZGVk'

但是,如果我忽略了领导b

>>> encoded = base64.b64encode('data to be encoded')

我收到以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python32\lib\base64.py", line 56, in b64encode
   raise TypeError("expected bytes, not %s" % s.__class__.__name__)
   TypeError: expected bytes, not str

为什么是这样?

Base64编码需要8位二进制字节数据和编码它仅使用字符A-Za-z0-9+/*所以它可以在不保留任何数据的所有8位,例如电子邮件信道来传输。

因此,它需要一个8位字节的字符串。您可以使用以下b''语法在Python 3中创建它们

如果删除b,它将成为一个字符串。字符串是Unicode字符序列。base64不知道如何处理Unicode数据,它不是8位的。实际上,它实际上一点也不。:-)

在第二个示例中:

>>> encoded = base64.b64encode('data to be encoded')

所有字符都完全适合ASCII字符集,因此base64编码实际上是没有意义的。您可以将其转换为ascii

>>> encoded = 'data to be encoded'.encode('ascii')

或更简单:

>>> encoded = b'data to be encoded'

在这种情况下,这将是同一件事。


*大多数base64口味=的末尾也可能包含a作为填充。此外,某些base64变体可能使用+以外的字符/有关概述,请参见Wikipedia的“变体”摘要表

简短答案

你需要一个推bytes-like对象(bytesbytearray,等)的base64.b64encode()方法。有两种方法:

>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

或带有变量:

>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'

为什么?

在Python 3中,str对象不是C样式的字符数组(因此它们不是字节数组),而是对象,它们是没有任何固有编码的数据结构。您可以通过多种方式对该字符串进行编码(或解释)。最常见的(在Python 3中是默认值)是utf-8,特别是因为它与ASCII向后兼容(尽管使用最广泛的编码也是如此)。这就是当您采用astring并对其调用.encode()方法时发生的事情:Python正在以utf-8(默认编码)解释字符串,并为您提供与其对应的字节数组。

Python 3中的Base-64编码

最初,问题标题是关于Base-64编码的。阅读有关Base-64的内容。

base64编码采用6位二进制块,并使用字符AZ,az,0-9,'+','/'和'='进行编码(某些编码使用不同的字符代替“ +”和“ /”) 。这是基于基数64或基数64的数字系统的数学构造的字符编码,但是它们有很大的不同。数学中的Base-64是一个数字系统,例如二进制或十进制,您可以对整个数字进行基数的这种更改,或者(如果要从中转换的基数是2的乘方小于64,则从右到大)剩下。

In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.

Example:

>>> data = b'test'
>>> for byte in data:
...     print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>

If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):

base-2:  01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10:                            1952805748
base-64:  B      0      Z      X      N      0

base64 encoding, however, will re-group this data thusly:

base-2:  011101  000110  010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10:     29       6      21     51     29      0
base-64:      d       G       V      z      d      A

So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.

Let's test this to see if I am being dishonest:

>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='

Why use base64 encoding?

Let's say I have to send some data to someone via email, like this data:

>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())

>>> print(data)
b'\x04msg\x08\x08\x08   '
>>>

There are two problems I planted:

  1. If I tried to send that email in Unix, the email would send as soon as the \x04 character was read, because that is ASCII for END-OF-TRANSMISSION (Ctrl-D), so the remaining data would be left out of the transmission.
  2. Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three BACKSPACE characters and three SPACE characters to erase the 'msg'. Thus, even if I didn't have the EOF character there the end user wouldn't be able to translate from the text on screen to the real, raw data.

This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.

If the data to be encoded contains "exotic" characters, I think you have to encode in "UTF-8"

encoded = base64.b64encode (bytes('data to be encoded', "utf-8"))

If the string is Unicode the easiest way is:

import base64                                                        

a = base64.b64encode(bytes(u'complex string: ñáéíóúÑ', "utf-8"))

# a: b'Y29tcGxleCBzdHJpbmc6IMOxw6HDqcOtw7PDusOR'

b = base64.b64decode(a).decode("utf-8", "ignore")                    

print(b)
# b :complex string: ñáéíóúÑ

There is all you need:

expected bytes, not str

The leading b makes your string binary.

What version of Python do you use? 2.x or 3.x?

Edit: See http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit for the gory details of strings in Python 3.x

本文地址:http://python.askforanswer.com/weishenmewoxuyaoyongblaiyongbase64bianmazifuchuan.html
文章标签: ,   ,  
版权声明:本文为原创文章,版权归 admin 所有,欢迎分享本文,转载请保留出处!

文件下载

老薛主机终身7折优惠码boke112

上一篇:
下一篇:

评论已关闭!