0

I can read one ann file into pandas dataframe as follows:

df = pd.read_csv('something/something.ann', sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
df.head()

But I don't know how to read multiple ann files into one pandas dataframe. I tried to use concat, but the result is not what I expected.

How can I read many ann files into one pandas dataframe?

Hildee
  • 79
  • 5
  • Hi @Irene and welcome to SO. There’s a couple of details you should add to your question so we can answer it: 1) some example of your data, typically the output of `df.head()` for a couple of files (as text, see [how to make a pandas reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples)), 2) the result you get with `concact` (and how you got it), and 3), whay you expected to get. – Cimbali Aug 16 '21 at 22:46

1 Answers1

1

It sounds like you need to use glob to pull in all the .ann files from a folder and add them to a list of dataframes. After that you probably want to join/merge/concat etc. as required.

I don't know your exact requirements but the code below should get you close. As it stands at the moment the script assumes, from where you are running the Python script, you have a subfolder called files and in that you want to pull in all the .ann files (it will not look at anything else). Obviously review and change as required as it's commented per line.

import pandas as pd
import glob

path = r'./files' # use your path
all_files = glob.glob(path + "/*.ann")

# create empty list to hold dataframes from files found
dfs = []

# for each file in the path above ending .ann
for file in all_files:
    #open the file
    df = pd.read_csv(file, sep='^([^\s]*)\s', engine='python', header=None).drop(0, axis=1)
    #add this new (temp during the looping) frame to the end of the list
    dfs.append(df)

#at this point you have a list of frames with each list item as one .ann file.  Like [annFile1, annFile2, etc.] - just not those names.

#handle a list that is empty
if len(dfs) == 0:
    print('No files found.')
    #create a dummy frame
    df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
    df = dfs[0]
#or concatenate more than one frame together
else: #modify this join as required.
    df = pd.concat(dfs, ignore_index=True)
    df = df.reset_index(drop=True)

#check what you've got
print(df.head())
MDR
  • 2,610
  • 1
  • 8
  • 18