Guido Camargo España

Home Curriculum Vitae Blog

Fine-tuning Pandoc conversion (.tex -> .docx) using filters to number equations, tables, and figures

by Guido España

June 14, 2019

Motivation

In my previous post, I mentioned how I use pandoc to convert .tex files to .docx files. I prefer to use LaTeX than Microsoft Word for multiple reasons, one of them being that it’s better for reproducible research. However, some people like to use Microsoft Word. So, it is important to be able to switch back and forth. As I mentioned in my previous post, this can be achieved with pandoc. However, I did not mention that the default conversion with pandoc produces unsatisfying results if one wants to share a scientific manuscript with people who mainly use Microsoft Word. Editing the template definitely helps with the presentation of the document (You can find an example of my template in this GitHub repository. Using the flag -M reference-section-title=References also helps with the structure of the manuscript, since by default, pandoc does not include the section title for the References section. All of these changes greatly improve the final result of the .docx document, but my main issue with my workflow was that equations, tables, and figures were not being numbered by pandoc. In this post, I briefly describe how to solve this issue with pandoc filters.

Pandoc filters

Pandoc filters are programs that read a document, process them, and output them for another Pandoc program to continue the document processing, or conversion. For instance, a pandoc filter can be used to capitalize all the words in a document before converting the document to another format. Pandoc filters can be written in python or lua; I prefer python because I don’t know lua. To use pandoc filters in python, there are two modules: pandocfilters and panflute.

Before moving on, I need to say that I looked, looked, and looked for ways to keeping the numbering when converting from .tex files to .docx files. I found a couple of suggestions about converting from .tex to .md and then from .md to .docx. I tried this, but didn’t work for me the way I wanted it to. So, I decided to customize a pandoc filter for this task.

I actually, modified an existing pandoc filter, which is used to number equations from .md to other format (pandoc-eqnos). This filter seems to work just fine, but it doesn’t have an option to use .tex files as input. For files and tables, there are similar packages (pandoc-fignos and pandoc-tablenos), but none of them work with .tex files as input. So, using pandoc-eqnos as a template, I wrote a simple filter for numbering equations, figures, and tables when converting from .tex to .docx. I did not have to change much in the original pandoc-eqnos, only the way to recognize the label of an equation (label) and the way of replacing the references to those equations (\ref). Other than that, the filter is pretty straightforward. First, it goes through the document looking for equations (or figures, or tables) and saves the label into a dictionary with the value being the number of the order that the filter finds the equation in the document. Then, the routine walks the document one more time to find references (\ref) that exist in the dictionary and substitutes them with the number found in the value of the dictionary. I posted these filters on GitHub for myself and in case there’s anyone interested: pandoc_tex_nos.

I haven’t fully tested these filters with enough document structures, so there are still some issues. For instance, the labels need to be inside the caption (\caption{\label{fig:test} This is figure test}). There are probably many other issues that I haven’t found, so please let me know if you find any other bug.