tisdag 7 november 2017

UTF-8 in Python

I have been working a lot with python lately. Especially building tools that convert from Excel to JSON and XML. The data have been translations and similar, so encoding has been a real issue. Generally, it works well in python, but when working in a Windows environment, it is easy to go wrong among the different encoding standards such as "UTF-8" and CP1252 and so on. I won't pretend to be an expert on this at all. I have just tried to get the stuff working without having to understand too much of it. So if you have any useful background information about the snippets below, please go ahead and share a comment.

Today this stole a couple of hours from me again, so I write this down as a memory annotation... once and for all!

When using openpyxl, all strings are converted to "UTF-8" when reading the data from the workbook, so there is no need to do any conversion there or try to find any obscure encoding settings in Excel. This far all strings, Chinese, Swedish or Hungarian, have arrived correct.

To Terminal

Getting it correct when writing them to the terminal was a bit of a trick though. The trick seems to be to always write the text as bytes. To do this to stdout I had to this:

sys.stdout.buffer.write(message.encode('utf-8'))


To JSON

When generating JSON-files it is important to make sure the json generator generates the right content by setting ensure_ascii=False.

Then write the file as a binary file. Whatever I do, things seem to go wrong if I decode the string before writing it.

with open(fileName, mode="wb") as jsonFile:
    jsonFile.write(bytes(json.dumps(dict_to_write, ensure_ascii=False), 'utf-8'))


To XML

Again, create a binary file and write an encoded byte array rather than decoding to a string. I have used pybx to create bindings from the XSD files. Pybx builds on the minidom libraries... I think.

from pyxb.namespace.builtin import XMLSchema_instance as xsi

bds = pyxb.utils.domutils.BindingDOMSupport()
bds.declareNamespace(xsi)
dom = self._root.toDOM(bds)
bds.addAttribute(dom.documentElement, xsi.createExpandedName('schemaLocation'), "http://my_target_ns/ns1 ../xsd/my_schema.xsd")
bds.addXMLNSDeclaration(dom.documentElement, xsi)

if self._root is None:
    raise Exception("There is no root element to write to file.")

if self._fileName is None:
    self._fileName = os.path.join(os.getcwd(), 'xml/' + self._root.name + '.xml')

with open(self._fileName, 'wb') as xmlfile:
    xmlfile.write(dom.toprettyxml(indent='  ', encoding='utf-8'))

Good Luck!