4

I am reading in a large csv in pandas with:

features = pd.read_csv(filename, header=None, names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort','DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'], usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'])

I get:

sys:1: DtypeWarning: Columns (6) have mixed types. Specify dtype option on import or set low_memory=False.
  %!PS-Adobe-3.0

How can I find the first line in the input which is causing this warning? I need to do this to debug the problem with the input file, which shouldn't have mixed types.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Simd
  • 19,447
  • 42
  • 136
  • 271

2 Answers2

2

Once Pandas has finished reading the file, you can NOT figure out which lines were problematic (see this answer to know why).

This means you should find a way while you are reading the file. For example, read the file line-by-line, and check the types of each line, if any of them doesn't match the expected type then you got the wanted line.

To achieve this with Pandas, you can pass chunksize=1 to pd.read_csv() to read the file in chunks (dataframes with size N, 1 in this case). See the documentation if you want to know more about this.

The code goes something like this:

# read the file in chunks of size 1. This returns a reader rather than a DataFrame
reader = pd.read_csv(filename,chunksize=1)

# get the first chunk (DataFrame), to calculate the "true" expected types
first_row_df = reader.get_chunk()
expected_types = [type(val) for val in first_row_df.iloc[0]] # a list of the expected types.

i = 1 # the current index. Start from 1 because we've already read the first row.
for row_df in reader:
    row_types = [type(val) for val in row_df.iloc[0]]
    if row_types != expected_types:
        print(i) # this row is the wanted one
        break
    i += 1

Note that this code makes an assumption that the first row has "true" types. This code is really slow, so I recommend that you actually only check the columns which you think are problematic (though this does not give much performance gain).

Qusai Alothman
  • 1,982
  • 9
  • 23
1
for endrow in range(1000, 4000000, 1000):
    startrow = endrow - 1000
    rows = 1000
    try:
        pd.read_csv(filename, dtype={"DstPort": int}, skiprows=startrow, nrows=rows, header=None,
                names=['Time','Duration','SrcDevice','DstDevice','Protocol','SrcPort',
                       'DstPort','SrcPackets','DstPackets','SrcBytes','DstBytes'],
                usecols=['Duration','SrcDevice', 'DstDevice', 'Protocol', 'DstPort',
                         'SrcPackets','DstPackets','SrcBytes','DstBytes'])
    except ValueError:
        print(f"Error is from row {startrow} to row {endrows}")

Split the file into multiple dataframes with 1000 rows each to see in which range of rows there is mixed type value that causes this problem.

kkkyy
  • 11
  • 1