2

I have two UTF-8 text files:

repr(file1.txt):

\nSTATEMENT OF WORK\n\n\nSTATEMENT OF WORK NO. 7\nEffective Date: February 15, 2015

repr(file2.txt):

RENEWAL/AMENDMENT\n\nTHIS agreement is entered as of July 25, 2014. b

Their respective Brat annotation files have the following annotation:

file1.ann:

T1  date 61 78  February 15, 2015

file2.ann:

T1  date 53 67   July 25, 2014.

But when I use python to retrieve the characters from .txt using above offsets, I get:

file1.read()[61:78]:

February 15, 2015

file2.read()[53:67]:

ly 25, 2014. b

Why does my offsetting work in the first case but not the second case?

GuSuku
  • 1,371
  • 1
  • 14
  • 30

1 Answers1

0

The problem comes from the fact the carriage returns (\r in the text file) and newline (\n) are not considered the same way in Windows and Unix/Mac. If you use a Windows system to generate or modify the .txt files there will be some '\r\n' but brat (that is not thought for Windows) will only counts the '\n' sign.

Using python, you may pass from Windows count to brat count using a dict after having opened the file with the argument newline='' that ensures '\r' will be present in the created W_Contents variable:

with open('file.txt', newline='', encoding='utf-8') as f:
    W_Content = f.read()

counter = -1
UfromW_dic = {}
for n, char in enumerate(W_Content):
    if char != '\r':
        counter += 1
        UfromW_dic[n] = counter

After that, the intial span [x,y] will be found at [UfromW_dic[x],UfromW_dic[y]].