I am using Jupyter Notebook to run some basic natural language processing on multiple text files. I am using two .ipynb files. One, which I am calling the "shell", reads in the files. It calls the second .ipynb (the core program), which runs the NLP.
(As you can tell, I am very much a beginner at this. I recognize that Jupyter Notebook is not ideal for this, but it is the current setup I'm using.)
The core file results in this:
return {'Cor':numCor, 'Sub':numSub, 'Ins':numIns, 'Del':numDel}
I have ten txt files I want to run the core NLP program on, and I want to end up with a dataframe with columns: 1) Filename (extracted from the name of the txt file), 2) Cor, 3) Sub, 4) Ins, and 5) Del. The integer results will populate the rows.
Each time I run the core:
z=wer(y,x)
it produces this:
{'Cor': 8, 'Sub': 0, 'Ins': 0, 'Del': 52}
But it produces it in this form:
0
Cor 8
Sub 0
Ins 0
Del 52
I need to try to transpose it, so I did this:
df2=pd.Series(z).to_frame()
df2.reset_index()
df = df2.T
Which produces this:
Cor Sub Ins Del
0 8 0 0 52
So far so good (I think). I want to use this sort of command to append the results in a loop, where it adds a row for each of the 10 text files:
orf += [{'Cor': df.Cor, 'Sub': df.Sub, 'Ins': df.Ins}]
'orf' is capturing from the dataframe, and I think that is part of my problem. Here are the results from the first two text files -- when it appends from the dataframe it's also taking the metadata (not sure that's the correct term) such as data type:
[{'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/716_Anton_test.txt',
'Cor': 0 52
Name: Cor, dtype: int64,
'Sub': 0 3
Name: Sub, dtype: int64,
'Ins': 0 0
Name: Ins, dtype: int64,
'Del': 0 5
Name: Del, dtype: int64},
{'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/936_Anton.txt',
'Cor': 0 60
Name: Cor, dtype: int64,
'Sub': 0 0
Name: Sub, dtype: int64,
'Ins': 0 0
Name: Ins, dtype: int64,
'Del': 0 0
Name: Del, dtype: int64},
I want to convert it back to a dataframe. The problem is that when I convert to a dataframe, I get this (only included 3 variables here for ease of formatting):
Cor Ins Sub
0 0 52 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 3 Name: Sub, dtype: int64
1 0 60 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
2 0 60 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
3 0 59 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 1 Name: Sub, dtype: int64
4 0 60 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
5 0 59 Name: Cor, dtype: int64 0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
I don't want all the strings that are printed there. I just want the second integer in each cell. For example, for the first row, I just want each cell to have 52, 5, 0, 3.
What I am looking for help with streamlining the appending process. I imagine there is a good way to do this without converting twice to dataframe.
Ultimately I need a dataframe that looks like this
Cor Sub Ins Del Filename
1 8 0 1 52 File1
2 6 0 0 52 File2
3 2 2 1 52 File3
4 1 3 0 52 File4
Thank you in advance for any advice you could offer!