1

I have a newbie question !

I have a panda dataframe that the source is a comma delimited csv file. The file doesn't have header.

I need to know for each row what is the len of the columns and after I need to remove the rows that have len superior to some value, for example 5.

What I have:

1,2,3,4,5,6

1,2,3

9,6,8

1,2,3,5,6

Desired Output:

1,2,3

9,6,8

I've searched some questions and answers like :

Delete rows from a pandas DataFrame based on a conditional expression involving len(string) giving KeyError

Select row using the length of list in pandas cell

How to remove a row from pandas dataframe based on the length of the column values?

But from what I understood it's always using some column name to do the filters and because in the file there is no header and the number of columns difer from row to row i'm not understanding how can be done.

Can you please help?

Thanks in advance!

Kirtash7
  • 25
  • 1
  • 6

2 Answers2

1

I see three possibilities to do that.

  1. Read the file twice (first to count the fields and second to read it into pandas applying the skiprows method)
  2. Read it into memory filtering out the invalid lines, then pass it to pandas using StringIO
  3. Read it into pandas with all columns (or num desired columns + 1) then only allow the rows where the excess columns contain NaN

The following examples use variable len_threshold which should be set to the number of columns allowed for a row and your_file_name which should contain the name of the csv text file.

Method 1: Read the file twice

You can do that with pandas for convenience. Like this:

# read the rows into one text column
df= pd.read_csv(your_file_name, names=['text'], sep='\n')
# count the separators
counts= df['text'].str.count(',')
# now all rows which have more or less than two separators are skipped
rows_to_skip= counts[counts > len_threshold].index.get_level_values(0).to_list()
pd.read_csv(your_file_name, names=list(range(len_threshold)), index_col=False, skiprows=rows_to_skip)

Note, that to apply this method you should be sure that, your fields do not contain the separator, as it does not check if the commas are inside a quoted text.

Method 2: reding into memory / variant: reading into pandas line by line

string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
    at_end= False
    i=0
    while not at_end:
        line= fp.readline()
        if line == '':
            break
        elif line.count(',') <= len_threshold:
            string_buffer.write(line)
# "rewind" the string_buffer in order to read it from it's start
string_buffer.seek(0)
df= pd.read_csv(string_buffer, names=list(range(len_threshold)), index_col=False)

Note, that as above to apply this method you should be sure that, your fields do not contain the separator, as it does not check if the commas are inside a quoted text. It needs more memory, so it is not applicable for very large files. You could also use a variant of this however and instead of writing the correct lines to a string buffer, reading them into pandas using read_csv. This way you also don't need to worry about type conversions, but pandas could get problems in guessing the types right by just looking at one column. If however you already know the ideal column types, you could of course pass them. The variant would look like this:

df= pd.DataFrame([], columns=range(len_threshold))
df_len=0
string_buffer= io.StringIO()
with open(your_file_name, 'rt') as fp:
    at_end= False
    i=0
    while not at_end:
        line= fp.readline()
        if line == '':
            break
        elif line.count(',') <= len_threshold:
            tmp_df= pd.read_csv(io.StringIO(line), names=range(len_threshold), index_col=False)
            df.loc[df_len]= tmp_df.iloc[0]
            df_len+= 1

Method 3: reading into a dataframe then filter out the incorrect lines

This is the simplest method of all.

# read the whole dataframe with all columns
df= pd.read_csv(your_file_name, header=None, index_col=False)
# define an indexer that considers all rows to be good which
# have nothing else in the access rows as `NaN`
if len(df.columns) > len_threshold:
    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns')
    df.drop(df[~good_rows].index.get_level_values(0), inplace=True)
    df.drop(df.columns[3:], axis='columns', inplace=True)

So this method would probably also allow rows to have excess field separators, as long as the fields are empty. In the version above it would also allow rows to have less than say 3 columns. If for example your third column always contains something in valid rows, it would be easy to exclude rows that are too short. You only would have to change the "good_rows" line to:

    good_rows= df.iloc[:, len_threshold:].isna().all(axis='columns') & ~df.iloc[:, 2].isna()
jottbe
  • 4,228
  • 1
  • 15
  • 31
0

If you pass the parameter header=None into pandas.read_csv(), the column names are integers indexed from 0. Thus, if you have the following "file.csv":

1,2,3,4,5,6
1,2,3
9,6,8
1,2,3,5,6

you can read it into a DataFrame using the following code:

import pandas as pd

df = pd.read_csv("file.csv", header=None, dtype="Int64")

If you were to execute print(df) your result would be:

   0  1  2    3    4    5
0  1  2  3    4    5    6
1  1  2  3  NaN  NaN  NaN
2  9  6  8  NaN  NaN  NaN
3  1  2  3    5    6  NaN

Now if you want to remove all rows that have greater than or equal to five non-NaN values, the following code should do the trick:

for index, row in df.iterrows():
    if sum(row.notnull()) >= 5:
        df.drop(index, inplace=True)

df.dropna(axis=1, how="all", inplace=True)

If you were to execute print(df) your new result would be:

   0  1  2
1  1  2  3
2  9  6  8

Now if you want to overwrite file.csv with the longer rows removed, it's as easy as:

df.to_csv("file.csv", header=False, index=False)
Clade
  • 966
  • 1
  • 6
  • 14