Auto-align text for a bilingual Rmarkdown -> LaTeX document

Question

Updated. See below.

I'm working on a bilingual report. Namely using Arabic and English languages. Using xelatex engine, mainfont: Arial and lang: ar YAML metadata, the document smoothly renders both Arabic and English languages (After a hustle).

How to auto-align text in Rmarkdown -> LaTeX document?

The problem is: the language chosen in the lang variable is aligned right-to-left, so the whole document follows this alignment. Whenever I want to insert a paragraph in English, I have to use [text]{dir="ltr"}. Is there a way to automatically align the paragraphs based on the language used? Any LaTeX package or Pandoc / Rmarkdown trick to do so? Pure LaTeX in the preamable?

Appendix - reprex (old)

If you need it, the following code is what you need to reproduce the problem.

---
output:
  pdf_document:
    latex_engine: xelatex
mainfont: Arial
lang: ar
---

بسم الله الرحمن الرحيم

This text is mis-aligned in rendered document.

[This text is well-aligned in rendered document.]{dir="ltr"}

Update

The following update is to incorporate the Lua filter kindly offered by @tarleb .

The bottom line is:

Before using the Lua filter, Arabic text was in the right direction and alignment, English text was in the wrong text direction(rtl) and alignment(rigth-aligned). See the rendered PDF without the filter here
The filter proposed by @tarleb aimed to detect the English text paragraphs and automatically set their direction to left-to-right.
The resulting document was that all text, regardless Arabic or English, was in the left-to-right direction and aligned to the left border of the page. See the resulting PDF here

I believe this inconvenience is because the Lua filter doesn't detect Latin/English characters only, it doesn't distinguish between Arabic VS English characters, aka. Latin VS non-Latin characters, so the filter just sets the direction of each paragraph in the document to be left-to-right.

So what happened is that the effects of lang: ar attribute is totally reversed by the Lua filter, and we have the same problem but now with the Arabic language instead of English.

Additionally, it appears that the alignment of the paragraphs follows the direction of tis text; if the document text direction is ltr, all paragraphs are aligned to the left border, and vice versa. I'm not sure this is true. My question here is how do we set the text direction and alignment of the document for each paragraph separately? Can we use a Lua filter that detects if the first character in a paragraph is Latin VS non-Latin and sets the text direction and alignment of this very paragraph accordingly, e.g ltr direction and left-aligned if Latin, rtl direction and right-aligned if non-Latin?

Many thanks in advance.

Updated reprex:

---
output:
  pdf_document:
    latex_engine: xelatex
    pandoc_args: '--lua-filter=ltr-paras.lua'
mainfont: Arial
lang: ar
---

بسم الله الرحمن الرحيم

Thanks to the Lua filter from **@tarleb**, the English text is well-aligned in rendered document without having to wrap it in {dir=ltr}. The text direction is left-to-right and the paragraph itself is aligned to the left border of the page. 

To get the Arabic text direction right, I have to wrap it inside {dir=rtl}:

[بسم الله الرحمن الرحيم]{dir="rtl"}

However, the Arabic paragraph is still aligned wrongfully to the left border of the page.

score 2 · Answer 1 · answered Dec 18 '21 at 20:05

2

That's a nice job for pandoc Lua filters. We use the filter to check if all characters in a paragraph are digits, Latin letters, punctuation, or whitespace. If that's the case, then we wrap the paragraph in a div with attribute dir='ltr' (one could also use lang='en').

function Para (para)
  local str = pandoc.utils.stringify(para)
  if str:match '^[%w%p%s]*$' then
    return pandoc.Div(para, pandoc.Attr('', {}, {dir='ltr'}))
  end
end

Place the above into a file in your document directory, e.g. ltr-paras.lua, then add this to your YAML:

output:
  pdf_document:
    latex_engine: xelatex
    pandoc_args: '--lua-filter=ltr-paras.lua'

answered Dec 18 '21 at 20:05

tarleb

19,863
4
51
80

It works! All paragraphs that shall be aligned ltr are aligned properly. However, any paragraph not including at least a latin/English character isn't aligned rtl. Looks like the Lua filter doesn't distinguish between the 2 languages. Wraping `[Arabic text]{dir="rtl"}` makes the words aligned in the appropriate order, but the paragraph itself is still aligned to the left side of the page. If you can kindly add another level in the Lua filter function that aligns paragraphs to the right side of the page when the paragraph starts with a non-latin character it will be awesome. – Hossam Ghorab Dec 19 '21 at 21:33
Can we (== you) kindly add a Lua filter that detects if the 1st character in the paragraph is a Latin or non-Latin character and aligns the paragraph to the left or the right side of page, respectively? [This](https://stackoverflow.com/questions/150033/regular-expression-to-match-non-ascii-characters) may help. – Hossam Ghorab Dec 19 '21 at 21:47
Glad to hear it works! I'm not sure I fully understand what parts are not working yet, could you edit the question and give an example? (Or post a new question and tag it with `pandoc`, which is usually enough for me to see it.) – tarleb Dec 19 '21 at 22:23
I updated the question to discuss the effects of the Lua filter. I also uploaded the rendered documents and supplementary images for your convenience. – Hossam Ghorab Dec 22 '21 at 07:50
@HossamGhorab thanks for the update. The filter seems to give a different result on my system; maybe I should install the exact versions you are using, but I can't do that right now. Two ideas you could try: add `dir: rtl` to the YAML header, and/or in the filter replace `dir='ltr'` with `lang='en'`. – tarleb Dec 22 '21 at 11:19
I'm really grateful for your help here. Using `lang="en"` in the filter just overrides the `lang: ar` I set in the metadata; no help with text direction or alignment,but now Arabic is [badly rendered](https://drive.google.com/file/d/1fB6-Du7id62ta3Ip8sEv8uzA7Fo3zqE9/view?usp=sharing). Using 'dir="rtl"` in the filter just [reverses the problem](https://drive.google.com/file/d/1cOzTaGfvndrxWsdQKusOdnKJvGUC8bSl/view?usp=sharing) we have with `dir="ltr"`. I bet the problem is in the regex used to detect text that shall be wrapped, not in what we want to wrap it in. TYT in response and thanks again – Hossam Ghorab Dec 22 '21 at 19:54

Auto-align text for a bilingual Rmarkdown -> LaTeX document

How to auto-align text in Rmarkdown -> LaTeX document?

Appendix - reprex (old)

Update

Updated reprex:

1 Answers1

Linked