-1

I am preprocessing data for an NLP task and need to structure the data in the following way:

[tokenized_sentence] tab [tags_corresponding_to_tokens]

I have a text file with thousands of lines in this format, where the two lists are separated by a tab. Here is an example

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']    ['I-ORG', 'O', 'I-MISC', 'O', 'O', 'O', 'I-MISC', 'O', 'O']

and the piece of code I used to get this is

with open('data.txt', 'w') as foo:
    for i,j in zip(range(len(text)),range(len(tags))):
        foo.write(str([item for item in text[i].split()]) + '\t' + str([tag for tag in tags[j]]) + '\n')

where text is a list containing sentences (i.e. each sentence is a string) and tags is a list of tags (i.e. the tags corresponding to each word/token in a sentence is a list).

I need to get the string elements in the lists to have double quotes instead of single quotes while maintaining this structure. The expected output should look like this

["EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", "."]    ["I-ORG",  "O", "I-MISC", "O", "O", "O", "I-MISC", "O", "O"]

I've tried using json.dump() and json.dumps() from the json module in Python but I didn't get the expected output as required. Instead, I get the two lists as strings. My best effort was to manually add the double quotes like this (for the tags)

for i in range(len(tags)):
    for token in tags[i]:
        tkn = "\"%s\"" %token
        print(tkn)

which gives the output

"I-ORG"
"O"
"I-MISC"
"O"
"O"
"O"
"I-MISC"
"O"
"O"
"I-PER"
"I-PER"
.
.
.

however, this seems too inefficient. I have seen these related questions

but they didn't address this directly.

I'm using Python 3.8

James Z
  • 12,209
  • 10
  • 24
  • 44
NewCoada
  • 15
  • 1
  • 6
  • 1
    if the words doesn't contain quotes you can simply use `replace("'", '"')` – deadshot Sep 21 '20 at 18:39
  • @deadshot I tried this `tags[0][0].replace("'",'"')` and the output did not change. I still get string elements with single quotes. – NewCoada Sep 21 '20 at 20:02

1 Answers1

0

I'm pretty sure there is no way to force python to write strings with double quotes; the default is single quotes. As @deadshot commented, you can either replace the ' with " after you write the whole string to the file, or manually add the double quotes when you write each word. The answer of this post has many different ways to do it, the simplest being f'"{your_string_here}"'. You would need to write each string separately though, as writing a list automatically adds ' around every item, and that would be very spaghetti.

Just do find and replace ' with " after you write the string to the file.

You can even do it with python:

# after the string is written in 'data.txt'
with open('data.txt', "r") as f:
    text = f.read()

text = text.replace("'", '"')

with open('data.txt', "w") as f:
    text = f.write(text)

Edit according to OP's comment below

Do this instead of the above; this should fix most of the problems, as it searches for the string ', ' which, hopefully, only appears at the end of one string and the start of the next

with open('data.txt', "r") as f:
    text = f.read()

# replace ' at the start of the list
text = text.replace("['", '["')

# replace ' at the end of the list
text = text.replace("']", '"]')

# replace ' at the item changes inside the list
text = text.replace("', '", '", "')

with open('data.txt', "w") as f:
    text = f.write(text)

(Edit by OP) New edit based on my latest comment

Running this solves the problem I described in the comment and returns the expected solution.

with open('data.txt', "r") as f:
    text = f.read()

# replace ' at the start of the list
text = text.replace("['", '["')

# replace ' at the end of the list
text = text.replace("']", '"]')

# replace ' at the item changes inside the list
text = text.replace("', '", '", "')

text = text.replace("', ", '", ')

text = text.replace(", '", ', "')

with open('data.txt', "w") as f:
    text = f.write(text)
Alex Mandelias
  • 436
  • 5
  • 10
  • This solves my problem, thank you! I realized though that it brings about another problem I did not anticipate, where apostrophes and phrases with quotes in the original text get replaced with `"`. But I will find a way to workaround this. – NewCoada Sep 21 '20 at 20:53
  • Ah my bad, I overlooked that. As long as any string inside the text doesn't contain the pattern `', '` (found at the end of one string and the start of the next) it should work fine. – Alex Mandelias Sep 22 '20 at 09:46
  • Your latest edit solved the apostrophe problem. In some cases though it removed double quotes, for example, `[ "Germany", ""s", "representative", ...]` became `["Germany', "'s", 'representative", ...]`. However, this was not a big problem and I quickly fixed it by adding two lines to your proposed solution. – NewCoada Sep 23 '20 at 20:31