python3 – 「きまぐれほげほげひろば」のTOPICS

2014年10月4日

[python3]HTMLテンプレートとしてformatメソッドを使用するコツ

pythonでCGIプログラムを書くと、str型(文字列型)の変数にテンプレートとなるHTMLを代入して可変のところだけ文字列フォーマット化してしまうことが多いが、python3から本格導入されたformatメソッドでは書式指定部分が「%」から「{}」に囲まれてた部分に変更されているので、HTML中のCSSやJavaScirptと非常に相性が悪い。（{ }の間で改行してもKeyとして認識されてしまう）
「%」のときと同様に同じ文字を重ねたらエスケープできるが（「{」→「{{」）、これをHTMLテンプレートの中で全部すると非常にめんどくさいし可読性も悪くなる。
なので、人がコードを記載するときのクセの違いを利用して、後から「{」を「{{」に置換してしまおうという作戦。
大抵の人が以下のようなクセになると思う。（完全に私感）
pythonのformatによる書式指定（{}の間はスペースを空けない）

 '<html><body>{body}</body></html>'.format(body="今日は晴天なり")

CSS/Javascript（可読性をあげるために{}の間に改行やスペースが入る）

 <html>
  <head>
    <style type="text/css">
        <!--
            dt{ backgroud: #bbb }
            .odd{ backgroud: #ddd }
        -->
    </style>
    <script type="text/javascript">
       function hogehoge() {
          print("本日は晴天なり");
       }
     </script>
</head>

これを踏まえて以下のようにすれば、HTMLテンプレート部分は毎度エスケープしなくてもよくなる。（{}を重ねる置換を実施してからformatをしている)

import re
htmltemplate='''
 HTMLのテンプレート
'''
html=re.sub(r'([^a-zA-Z0-9])}',r'\1}}',re.sub(r'{([^a-zA-Z0-9])',r'{{\1',htmltemplate,0,re.M|re.S),0,re.M|re.S).format(書式指定)

2014年10月4日

[python3]デフォルト文字コードの指定(CGI実行時)

前の記事で「python3のデフォルト文字コードがUTF-8だ」なんて書いたが、それはログインプロンプトからログインしたときだけであって、ログインせずに実行してしまうスクリプト(特にCGI)は、そうではない。このカラクリは環境変数「LANG」からpythonで使用するデフォルトの文字コードを取得して自動的に設定しているからである。

$ echo $LANG
ja_JP.UTF-8

CGIの実行ユーザになるであろう「nobody」「apache」「httpd」などはログインシェルは設定されていない(/bin/falseや/sbin/nologinなど）ので環境変数など設定できるわけもなくLANGはデフォルトの「C」(=ascii)であるので、pythonの文字コードもasciiとなる。
なので、CGIで日本語を含む文字列を出力しようとしたら例のごとく

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-14: ordinal not in range(128)

となってしまう。

解決方法

ソースの先頭のほうに以下を追記する。標準出力とエラー出力のエンコードをUTF-8に設定している

import io,sys
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')

※importで呼び出す側およびにimportで呼び出される側両方のソースに上記コードを記載すると、だんまりエラーで異常終了するので注意！（エラーメッセージ出ず。例外処理も効かない。）

追記

ファイルに出力するときもpythonの文字コードがasciiになってしまっているが、上記方法では補正できない。
ファイル出力の文字コードを正しいものに修正するには、ファイルオープン時に文字コードを指定する。

fh=open("hoge.txt","a",encoding='utf-8')

2014年9月25日

[python3]文字コードの判定

python3になってから文字列型はUTF-8になったため、文字コード不明のファイルを開いたり、ダウンロードしたりするとread()でUnicodeDecodeErrorになってしまう。（例えば、EUCの変数データを無変換でUTF-8を期待する変数に代入することはできない）

>>> fhinput = open("eucsample.txt","r")
>>> fhinput.read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/python3/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa4 in position 0: invalid start byte
>>> fhinput.close()

ファイルを開く際にバイナリモードで開けばread()メソッドの型はバイナリ(bytes)型になるので、python2の頃と同様にファイルの内容を変数に格納できるので総当りでdecode(文字コード変換）に挑戦できる。以下のサンプルコードは文字コードEUCのファイルをバイナリで開いて総当り文字コード変換を試みた例。

>>> fhinput = open("eucsample.txt","rb")
>>> htmlbytes=fhinput.read()
>>> htmlbytes
b'\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\n\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\n\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa8\n\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\n\xc8\xf8\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa'
>>> htmlbytes.decode('shift_jisx0213')
'､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢､｢\n､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､､\n､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｦ､ｨ\n､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ､ｨ\nﾈ?ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ､ｪ'
>>> htmlbytes.decode('iso2022jp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'iso2022_jp' codec can't decode byte 0xa4 in position 0: illegal multibyte sequence
>>> htmlbytes.decode('euc_jisx0213')
'ああああああああああああああああああああああああ\nいいいいいいいいいいいいいい いいいいいいいいいいいい\nうううううううううううううううううううううえ\nええええええええええええええ\n尾おおおおおおおおおおおおおおお'

EUCのはずなのにshift_jisx0213で成功してしまってますが・・・
総当りで文字コード変換に挑戦するコードの例は以下。

def conv_charset_file(inputfile,outputfile):
    try:
        fhinput = open(inputfile,"rb")
        htmlbytes = fhinput.read()
        fhinput.close()
    except:
        return None,"文字コード変換できません"
        
    codelst = ('utf_8','euc_jisx0213','shift_jisx0213','iso2022jp','iso2022_jp_ext','iso2022_kr','big5','big5hkscs','johab','euc_kr','utf_16','iso8859_15','latin_1','ascii')
    
    code = ""
    for encoding in codelst:
        try:
            htmlstr = htmlbytes.decode(encoding) # bytes文字列から指定文字コードの文字列に変換
            htmlstr = htmlstr.encode('utf-8') # uft-8文字列に変換
            code=encoding
            break
        except:
            pass
            
    if code == "" :
        return None,"文字コード変換できません"
    
    try:
        fhwrite = open(outputfile,"w")
        fhwrite.write(htmlstr)
        fhwrite.close()
    except:
        return None,"文字コード変換できません"
    
    return code,None

ちなみにHTMLであればファイル内に文字コードが定義されているので、そこから文字コードを拾ってこればよいのでは思ってしまうが、そうはうまくいかない。内容を検索するには変数に格納しないといけないが、バイナリ型ではないと格納できない。またバイナリ型は文字列比較できないので、文字列型に変換しないといけないが、decodeせずに文字列に変換すると１６進数表記をさらに「\」をエスケープされているので、全く別物の文字列と化してしまっている

 >>> fhinput = open("eucsample.txt","rb")
>>> htmlbytes=fhinput.read()
>>> htmlbytes
b'\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\xa4\xa2\n\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\xa4\n\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa6\xa4\xa8\n\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\xa4\xa8\n\xc8\xf8\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa\xa4\xaa'
>>> import re
>>> regcheck=re.compile('content="text/html; *?charset="*(.+?)"',re.I|re.S|re.M)
>>> regcheck.search(htmlbytes)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't use a string pattern on a bytes-like object
>>> regcheck.search(str(htmlbytes))
>>>
>>> str(htmlbytes)
"b'\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\xa4\\xa2\\n\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\xa4\\n\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa6\\xa4\\xa8\\n\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\xa4\\xa8\\n\\xc8\\xf8\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa\\xa4\\xaa'"