I used the SpaCy library to generate dependencies and save it into a CoNLL format using the code below.
import pandas as pd
import spacy
df1 = pd.read_csv('cleantweets', encoding='latin1')
df1['tweet'] = df1['tweet'].astype(str)
tweet_list = df1['tweet'].values.tolist()
nlp = spacy.load("en_core_web_sm")
for i in tweet_list:
doc = nlp(i)
for sent in doc.sents:
print('\n')
for i, word in enumerate(sent):
if word.head is word:
head_idx = 0
else:
head_idx = doc[i].head.i + 1
print("%d\t%s\t%d\t%s\t%s\t%s" % (
i+1,
word.head,
head_idx,
word.text,
word.dep_,
word.pos_,
))
This works, but there are some sentences in my dataset that get splits into two by Spacy because they have two ROOTS. This results in having two fields for one sentence in the CoNLL format.
Example: A random sentence from my dataset is : "teanna trump probably cleaner twitter hoe but"
in CoNLL format it is saved as :
1 trump 2 teanna compound
2 cleaner 4 trump nsubj
3 cleaner 4 probably advmod
4 cleaner 4 cleaner ROOT
5 hoe 6 twitter amod
6 cleaner 4 hoe dobj
1 but 2 but ROOT
Is there a way to save it all in one field instead of two even though it has two ROOTS so that 'but' becomes 7th item in field number 1? Which means it would look like this instead
1 trump 2 teanna compound
2 cleaner 4 trump nsubj
3 cleaner 4 probably advmod
4 cleaner 4 cleaner ROOT
5 hoe 6 twitter amod
6 cleaner 4 hoe dobj
7 but 2 but ROOT