1

I am using Jupyter Notebook to run some basic natural language processing on multiple text files. I am using two .ipynb files. One, which I am calling the "shell", reads in the files. It calls the second .ipynb (the core program), which runs the NLP.

(As you can tell, I am very much a beginner at this. I recognize that Jupyter Notebook is not ideal for this, but it is the current setup I'm using.)

The core file results in this:

return {'Cor':numCor, 'Sub':numSub, 'Ins':numIns, 'Del':numDel}

I have ten txt files I want to run the core NLP program on, and I want to end up with a dataframe with columns: 1) Filename (extracted from the name of the txt file), 2) Cor, 3) Sub, 4) Ins, and 5) Del. The integer results will populate the rows.

Each time I run the core:

z=wer(y,x)

it produces this:

{'Cor': 8, 'Sub': 0, 'Ins': 0, 'Del': 52}

But it produces it in this form:

    0
Cor 8
Sub 0
Ins 0
Del 52

I need to try to transpose it, so I did this:

df2=pd.Series(z).to_frame()
df2.reset_index()
df = df2.T 

Which produces this:

    Cor Sub Ins Del
0   8   0   0   52

So far so good (I think). I want to use this sort of command to append the results in a loop, where it adds a row for each of the 10 text files:

 orf += [{'Cor': df.Cor, 'Sub': df.Sub, 'Ins': df.Ins}]

'orf' is capturing from the dataframe, and I think that is part of my problem. Here are the results from the first two text files -- when it appends from the dataframe it's also taking the metadata (not sure that's the correct term) such as data type:

[{'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/716_Anton_test.txt',
  'Cor': 0    52
  Name: Cor, dtype: int64,
  'Sub': 0    3
  Name: Sub, dtype: int64,
  'Ins': 0    0
  Name: Ins, dtype: int64,
  'Del': 0    5
  Name: Del, dtype: int64},
 {'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/936_Anton.txt',
  'Cor': 0    60
  Name: Cor, dtype: int64,
  'Sub': 0    0
  Name: Sub, dtype: int64,
  'Ins': 0    0
  Name: Ins, dtype: int64,
  'Del': 0    0
  Name: Del, dtype: int64},

I want to convert it back to a dataframe. The problem is that when I convert to a dataframe, I get this (only included 3 variables here for ease of formatting):

    Cor                             Ins                         Sub
0   0 52 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 3 Name: Sub, dtype: int64
1   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
2   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
3   0 59 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 1 Name: Sub, dtype: int64
4   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
5   0 59 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64

I don't want all the strings that are printed there. I just want the second integer in each cell. For example, for the first row, I just want each cell to have 52, 5, 0, 3.

What I am looking for help with streamlining the appending process. I imagine there is a good way to do this without converting twice to dataframe.

Ultimately I need a dataframe that looks like this

    Cor Sub Ins Del Filename
1   8   0   1   52  File1
2   6   0   0   52  File2
3   2   2   1   52  File3
4   1   3   0   52  File4

Thank you in advance for any advice you could offer!

  • 3
    Hello Jeanne, welcome to StackOverflow! Please take a look at the [help]. Generally, when asking for debugging help, you are required to provide a [mcve]. I know pandas examples can sometimes be a bit hard to create good reproducible examples of, so check out this [question for some tips](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). In general, try to provide example data and code as *formatted text*. Do not provide images of data/code. To format code, copy-and-paste from the source, highlight in the stack overflow editor and press ctrl-K – juanpa.arrivillaga Oct 18 '18 at 03:11
  • Thank you, I appreciate the feedback! I will work on posting an example. – Jeanne Sinclair Oct 18 '18 at 13:17

0 Answers0