1

I would like to select specifics rows when reading a csv with pandas but I also would like to keep the last 5 to 8 columns as a one column because they all represent "genres" in my case.

I have tried to put the flag usecols=[0,1,2,np.arange(5,8)] when using pd.read_csv bubt it does not work.

If I use the flag usecols=[0,1,2,5], I just get one genre in the last column and the others (6, 7, 8) are lost.

I have tried the following but without succeeding:


items = pd.read_csv(filename_item,
                    sep='|',
                    engine='python',
                    encoding='latin-1',
                    usecols=[0,1,2,np.arange(5,23)],
                    names=['movie_id', 'title', 'date','genres'])

My CSV looks like:

2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic

And I would like to get:

2 - Scream of Stone (Schrei aus Stein) - (1991) - 08-Mar-1996 - drama|comedia|fun|romantic

, where what I drew separated by "-" should be a column of the dataframe.

Thank you

2 Answers2

1

You may need to do this in 2-passes. Firstly read the csv in as is:

In[56]:
import pandas as pd
import io
t="""2|Scream of Stone (Schrei aus Stein)|(1991)|08-Mar-1996|dd|xx|drama|comedia|fun|romantic"""
df = pd.read_csv(io.StringIO(t), sep='|', usecols=[0,1,2,3,*np.arange(6,10)], header=None)
df
Out[56]: 
   0                                   1       2            3      6        7  \
0  2  Scream of Stone (Schrei aus Stein)  (1991)  08-Mar-1996  drama  comedia   

     8         9  
0  fun  romantic  

Then we can join all the genres together using apply:

In[57]:
df['genres'] = df.iloc[:,4:].apply('|'.join,axis=1)
df
Out[57]: 
   0                                   1       2            3      6        7  \
0  2  Scream of Stone (Schrei aus Stein)  (1991)  08-Mar-1996  drama  comedia   

     8         9                      genres  
0  fun  romantic  drama|comedia|fun|romantic  
EdChum
  • 376,765
  • 198
  • 813
  • 562
0

My solution is based on a piece of code proposed at: How to pre-process data before pandas.read_csv()

The idea is to write a "file wrapper" class, which can be passed to read_csv.

class InFile(object):
    def __init__(self, infile):
    self.infile = open(infile)

    def __next__(self):
        return self.next()

    def __iter__(self):
        return self

    def read(self, *args, **kwargs):
            return self.__next__()

    def next(self):
        try:
            line = self.infile.readline()
            return re.sub('\|', ',', line, count=6)
        except:
            self.infile.close()
            raise StopIteration

Reformatting of each source line is performed by:

re.sub('\|', ',', line, count=6)

which changes first 6 | chars into commas, so you can read it without sep='|'.

To read your CSV file, run:

df = pd.read_csv(InFile('Films.csv'), usecols=[0, 1, 2, 3, 6],
    names=['movie_id', 'title', 'prod', 'date', 'genres'])
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41