I am preprocessing data for an NLP task and need to structure the data in the following way:
[tokenized_sentence] tab [tags_corresponding_to_tokens]
I have a text file with thousands of lines in this format, where the two lists are separated by a tab. Here is an example
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'] ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']
and the piece of code I used to get this is
with open('data.txt', 'w') as foo:
for i,j in zip(range(len(text)),range(len(tags))):
foo.write(str([item for item in text[i].split()]) + '\t' + str([tag for tag in tags[j]]) + '\n')
where text is a list containing sentences (i.e. each sentence is a string) and tags is a list of tags (i.e. the tags corresponding to each word/token in a sentence is a list).
I need to get the string elements in the lists to have double quotes instead of single quotes while maintaining this structure. The expected output should look like this
["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."] ["I-ORG", "O", "I-MISC", "O", "O", "O", "I-MISC", "O", "O"]
I've tried using json.dump()
and json.dumps()
from the json
module in Python but I didn't get the expected output as required. Instead, I get the two lists as strings. My best effort was to manually add the double quotes like this (for the tags)
for i in range(len(tags)):
for token in tags[i]:
tkn = "\"%s\"" %token
print(tkn)
which gives the output
"I-ORG"
"O"
"I-MISC"
"O"
"O"
"O"
"I-MISC"
"O"
"O"
"I-PER"
"I-PER"
.
.
.
however, this seems too inefficient. I have seen these related questions
- Convert single-quoted string to double-quoted string
- Converting a Text file to JSON format using Python
but they didn't address this directly.
I'm using Python 3.8