1

I have a dataframe which has been reduced to a single column called Filename (already sorted in order) which contains a list of filenames which may or may not repeat themselves.

For example

Filename
/dir1/dir2/abc.jpg
/dir1/dir2/abc.jpg
/dir1/dir2/def.jpg
/dir1/dir2/hij.jpg
/dir1/dir2/hij.jpg
/dir1/dir2/hij.jpg
/dir1/dir2/hij.jpg
/dir1/dir2/hij.jpg
/dir1/dir2/klm.jpg
/dir1/dir2/klm.jpg

Using python 3.6 and pandas I’m trying to obtain for each file name the number of incidences The output should be a dataframe ,an example is shown below

Filename        Instances
/dir1/dir2/abc.jpg  2
/dir1/dir2/def.jpg  1
/dir1/dir2/hij.jpg  5
/dir1/dir2/klm.jpg  2

I’ve worked out a way to do this by converting to a list and then counting, however I’d like to keep this as a dataframe as its going to be pumped back into some machine learning, and converting to and from a list,then back again appears to be a poor route to take

I’ve tried code like

df = df.groupby('FileName')
df.groupby(['FileName']).count()
df = df.groupby('FileName').nunique()

but none appear to work. The data frame has been defined in the past with 15 columns, and they have been deleted with code like

df = df.drop(['Column1Name', 'Column2Name',], axis=1)

The above example only deletes 2 columns (for simplicity) , but in real life there are 14 entered so, I’m wondering if this or the fact I have not identified a new column called Quantity (to store the quantities counted), has anything to do with it.

Any help would be much appreciated

hygull
  • 8,464
  • 2
  • 43
  • 52
AndrewW
  • 33
  • 3

1 Answers1

0

You can try like this.

Using pandas.DataFrame.groupby()

>>> import pandas as pd
>>>
>>> s = """/dir1/dir2/abc.jpg
... /dir1/dir2/abc.jpg
... /dir1/dir2/def.jpg
... /dir1/dir2/hij.jpg
... /dir1/dir2/hij.jpg
... /dir1/dir2/hij.jpg
... /dir1/dir2/hij.jpg
... /dir1/dir2/hij.jpg
... /dir1/dir2/klm.jpg
... /dir1/dir2/klm.jpg"""
>>>
>>> filenames = s.split('\n')
>>> filenames
['/dir1/dir2/abc.jpg', '/dir1/dir2/abc.jpg', '/dir1/dir2/def.jpg', '/dir1/dir2/hij.jpg', '/dir1/dir2/hij.jpg', '/dir1/dir2/hij.jpg', '/dir1/dir2/hij.jpg', '/dir1/dir2/hij.jpg', '/dir1/dir2/klm.jpg', '/dir1/dir2/klm.jpg']
>>>
>>> df = pd.DataFrame(d)
>>> df
             Filename
0  /dir1/dir2/abc.jpg
1  /dir1/dir2/abc.jpg
2  /dir1/dir2/def.jpg
3  /dir1/dir2/hij.jpg
4  /dir1/dir2/klm.jpg
5  /dir1/dir2/klm.jpg
>>>
>>> d = {"Filename": filenames}
>>> df = pd.DataFrame(d)
>>>
>>> df
             Filename
0  /dir1/dir2/abc.jpg
1  /dir1/dir2/abc.jpg
2  /dir1/dir2/def.jpg
3  /dir1/dir2/hij.jpg
4  /dir1/dir2/hij.jpg
5  /dir1/dir2/hij.jpg
6  /dir1/dir2/hij.jpg
7  /dir1/dir2/hij.jpg
8  /dir1/dir2/klm.jpg
9  /dir1/dir2/klm.jpg
>>>
>>> groups = df.groupby('Filename').groups
>>> groups
{'/dir1/dir2/abc.jpg': Int64Index([0, 1], dtype='int64'), '/dir1/dir2/def.jpg': Int64Index([2], dtype='int64'), '/dir1/dir2/hij.jpg': Int64Index([3, 4, 5, 6, 7], dtype='int64'), '/dir1/dir2/klm.jpg': Int64Index([8, 9], dtype='int64')}
>>>
>>> instances = []
>>> filenames = []
>>>
>>> for group in groups:
...     instances.append(len(groups[group]))
...     filenames.append(group)
...
>>> df = pd.DataFrame({"Filename": filenames, "Instances": instances})
>>> df
             Filename  Instances
0  /dir1/dir2/abc.jpg          2
1  /dir1/dir2/def.jpg          1
2  /dir1/dir2/hij.jpg          5
3  /dir1/dir2/klm.jpg          2
>>>
hygull
  • 8,464
  • 2
  • 43
  • 52
  • I've tried all of these and they do not work (well not for me any way. All i get is the filenames for both df.groupby('FileName').size() or df.groupby('FileName').count() – AndrewW Dec 31 '18 at 10:53