Exporting Tokenized SpaCy result into Excel or SQL tables

Question

I'm using SpaCy with Pandas to get a sentence tokenised with Part of Speech (POS)export to excel. The code is as follow:

import spacy
import xlsxwriter
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
doc = nlp(text)
for token in doc:
    x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
    print(x)

When I print(x)I get the following:

['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False]
['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True]
['a', 'a', 'DET', 'DT', 'det', 'x', True, True]
['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False]
['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False]
['.', '.', 'PUNCT', '.', 'punct', '.', False, False]

To the token loop, I added the DataFrame as follow: for token in doc:

for token in doc:
    x=[token.text, token.lemma_, token.pos_, token.tag_,token.dep_,token.shape_, token.is_alpha, token.is_stop]
    df=pd.Dataframe(x)
    print(df)

Now, I stat to get the following format:

  0
0      He
1  -PRON-
2    PRON
3     PRP
4   nsubj
5      Xx
6    True
7   False   
........
........

However, when I try exporting the output (df) to excel using Pandas as the following code, it only shows me the last iteration of x in the column

df=pd.DataFrame(x)
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
df.to_excel(writer,sheet_name='Sheet1')

Output (in Excel Sheet):

0
0      .
1      .
2  PUNCT
3      .
4  punct
5      .
6  False
7  False

How I can have all the iterations one after the other in the new column in this scenario as follow?

 0     He      is   ….
1    -PRON-    be   ….
2     PRON    VERB  ….
3     PRP      VBZ  ….
4    nsubj     ROOT ….
5      Xx      xx   ….
6    True     True  ….
7    False   True   ….

@EvgenyPogrebnyak, how? can you please tell me how to change it using df.append? — Fasa, Jun 16 '18 at 10:17
try go through https://pandas.pydata.org/pandas-docs/stable/10min.html#object-creation, there is nothing major difficult, write back if not successfil. I got problem installing `SpaCy` due to lack of compiler, so cannot give you quick ready code. — Evgeny, Jun 16 '18 at 10:53

Evgeny · Accepted Answer · 2018-06-19T04:51:30.453

1

Some shorter code:

import spacy
import pandas as pd
nlp = spacy.load('en_core_web_sm')
text ="""He is a good boy."""
param = [[token.text, token.lemma_, token.pos_, 
          token.tag_,token.dep_,token.shape_, 
          token.is_alpha, token.is_stop] for token in nlp(text)]
df=pd.DataFrame(param)
headers = ['text', 'lemma', 'pos', 'tag', 'dep', 
           'shape', 'is_alpha', 'is_stop']
df.columns = headers

edited Jun 19 '18 at 04:51

answered Jun 18 '18 at 21:13

Evgeny

4,173
2
19
39

Thanks. Just got `ValueError: Length mismatch: Expected axis has 6 elements, new values have 8 elements`. So I tried `df` WITHOUT `.transpose()` which works perfectly and fits the purpose. ( as you mentioned, better to have text snippets in rows than in columns) – Fasa Jun 19 '18 at 02:15
Edited with no `.transpose()`, that was really there in error. – Evgeny Jun 19 '18 at 04:52

score 0 · Answer 2 · answered Jun 17 '18 at 21:54

0

In case you don't have your version yet:

import pandas as pd

rows =[
    ['He', '-PRON-', 'PRON', 'PRP', 'nsubj', 'Xx', True, False],
    ['is', 'be', 'VERB', 'VBZ', 'ROOT', 'xx', True, True],
    ['a', 'a', 'DET', 'DT', 'det', 'x', True, True],
    ['good', 'good', 'ADJ', 'JJ', 'amod', 'xxxx', True, False],
    ['boy', 'boy', 'NOUN', 'NN', 'attr', 'xxx', True, False],
    ['.', '.', 'PUNCT', '.', 'punct', '.', False, False],
    ]

headers = ['text', 'lemma', 'pos', 'tag', 'dep', 
           'shape', 'is_alpha', 'is_stop']

# example 1: list of lists of dicts
#following  https://stackoverflow.com/a/28058264/1758363
d = []
for row in rows:
    dict_ = {k:v for k, v in zip(headers, row)}
    d.append(dict_)
df = pd.DataFrame(d)[headers] 

# example 2: appending dicts 
df2 = pd.DataFrame(columns=headers)
for row in rows:
    dict_ = {k:v for k, v in zip(headers, row)}
    df2 = df2.append(dict_, ignore_index=True)

#example 3: lists of dicts created with map() function
def as_dict(row):
    return {k:v for k, v in zip(headers, row)}

df3 = pd.DataFrame(list(map(as_dict, rows)))[headers]     

def is_equal(df_a, df_b):
    """Substitute for pd.DataFrame.equals()"""
    return (df_a == df_b).all().all()

assert is_equal(df, df2)
assert is_equal(df2, df3)

answered Jun 17 '18 at 21:54

Evgeny

4,173
2
19
39

Thanks @Evengy . I will try this code to see how it may fit into this scenario. BTW, after working on it, I finally used an alternative lib (CSV lib) to import the output as a dictionary and it works but I have CSV as the output instead of excel so I have to go through second round of transition from CSV to excel. I do feel Pandas DataFrame is somehow not fully compatible with what SpaCy provides as doing it via CSV is relatively easy within a few lines – Fasa Jun 18 '18 at 13:55
you can use `csv` if you are more comfartable with it, it just did not appear in your question you wanted to persist/save data, it was rather about use of `pandas`. hope you are not fully dissatisfatied with transformation idea - SpaCy does not have any special type of output, just namedtuples I guees. in any case avoid Excel at all cost as format to save intermediate data and make sure words are in rows, not columns. – Evgeny Jun 18 '18 at 14:22
@Evengy, just a quick feedback on your code. All your 3 methods works perfectly when the data (rows) is a statistic array. Nevertheless, once it is called as a "token" from SpaCy, various errors pops up. That what I feel, the way there NLP argument of SPaCY handles the loop does not properly fit into Pandas Data Framework. Using primitive libs like CSV (CSV.DicWriter) do the job within few lines and then this Pandas can take it further to Excel. Thanks anyway for the input and please share your thoughts on this if you have any other opinion. – Fasa Jun 18 '18 at 14:22
Glad you have it sorted out for your project! How do you compile SpaCy on windows to install? is there a command to test the compiler? I installed Visual Studio, but `pip install spacy` stops on me with `error: command 'cl.exe' failed: No such file or directory` error. – Evgeny Jun 18 '18 at 14:26
1

Try it under Conda environment (Anaconda). I think Spacy has its own requirements which might be better to be installed in isolation (from root Python). – Fasa Jun 18 '18 at 14:29
worked in a separate environment as you mentioned! will replicate your code and post – Evgeny Jun 18 '18 at 14:41

Exporting Tokenized SpaCy result into Excel or SQL tables

2 Answers2