finding the maximum number of columns in a file or csv using Python

Question

I am trying to evaluate some data using Python. A sample of the data is as follows

****************************************************************
*                       SAMPLE DATA                            *
****************************************************************
* Date      Times    Severity  Source  Machine     State   
18-May-2019 16:28:18    I   StatesLog   Off-Line   States: IALT-1
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: TdALclr-0
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: S722a1-0, S722a2-0, S722ascon-0
18-May-2019 16:28:19    I   StatesLog   Off-Line   States: !S722a1-1, S722(OFF)-0, !S722a2-1

What I'm after (eventually) is

Time                    Data
18-May-2019 16:28:18    IALT-1
18-May-2019 16:28:19    TdALclr-0
18-May-2019 16:28:19    S722a1-0, 
18-May-2019 16:28:19    S722a2-0, 
18-May-2019 16:28:19    S722ascon-0
18-May-2019 16:28:19    !S722a1-1, 
18-May-2019 16:28:19    S722(OFF)-0, 
18-May-2019 16:28:19    !S722a2-1

With data this short I can manually adjust the amount of columns I need, but as some of the data is over 100Mb in size I have no idea of how many columns I'll need to put it into a DataFrame.

I have tried the code below to remove the large header

import pandas as pd

with open('test.txt') as oldfile, open('newtest.txt', 'w') as newfile:
    newfile.write('Date      Times    Severity  Source  Machine State  Data Data1 Data2')
    for line in oldfile:
        if '*' not in line:
            newfile.write(line)

df = pd.read_table('newtest.txt', sep ='\s+', engine = 'python') 
df[['Date', 'Times', 'Data', 'Data1', 'Data2']].to_csv('trial.csv')

Which has worked up to a point but after a while with the normal data I get the a parse error that there were too many fields in the line from the "read_table" command "Expected X fields in line Z, saw Y". I presume this is because the number of columns is taken from the top line?

I need a way to read file to know the maximum amount of columns to pass somehow to pandas to eleviate the error. The column names don't matter for now as I can always adjust them later in the code.

Then hopefully the bottom part of my code will give me the results I'm after

df['Time'] = df['Date'].astype(str) + ' ' +df['Times']
a = df.set_index('Time').stack()
df = a[a !=0].reset_index(drop=True, level=1).reset_index(name='Data').to_csv('output.csv')

score 0 · Answer 1 · answered Jun 05 '19 at 23:12

First use str.replace to remove States: from your column. Then use this function to unnest your values to rows:

df['State'] = df['State'].str.replace('States:', '')

df = explode_str(df, 'State', ',').reset_index(drop=True)

          Date     Times Severity     Source   Machine         State
0  18-May-2019  16:28:18        I  StatesLog  Off-Line        IALT-1
1  18-May-2019  16:28:19        I  StatesLog  Off-Line     TdALclr-0
2  18-May-2019  16:28:19        I  StatesLog  Off-Line      S722a1-0
3  18-May-2019  16:28:19        I  StatesLog  Off-Line      S722a2-0
4  18-May-2019  16:28:19        I  StatesLog  Off-Line   S722ascon-0
5  18-May-2019  16:28:19        I  StatesLog  Off-Line     !S722a1-1
6  18-May-2019  16:28:19        I  StatesLog  Off-Line   S722(OFF)-0
7  18-May-2019  16:28:19        I  StatesLog  Off-Line     !S722a2-1

If you want to remove the other columns perse:

explode_str(df, 'State', ',')[['Date', 'State']].reset_index(drop=True)

          Date         State
0  18-May-2019        IALT-1
1  18-May-2019     TdALclr-0
2  18-May-2019      S722a1-0
3  18-May-2019      S722a2-0
4  18-May-2019   S722ascon-0
5  18-May-2019     !S722a1-1
6  18-May-2019   S722(OFF)-0
7  18-May-2019     !S722a2-1

Function used from other answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

score 0 · Answer 2 · answered Jun 09 '19 at 08:20

I managed to work out the columns part from trial and error. I'm fairly new to python so although this works it probably isn't the best or cleanest way of doing things. It does take a loooooong time work out which row has the most columns for large data.

This does stop Erfan's code from working though

import numpy as np
import csv
import os


with open('test.txt', 'r') as oldfile, open('newtest.txt', 'w') as newfile:
    for line in oldfile:
        newfile.write(line)  # Leave original file untouched and save a copy to modify

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()                       # remove the "," from the data part and replace with ' '
    f.write(content.replace(',', ' ')) # all info now has ' ' seperator

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(' ', ',')) # replace seperator with ','

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,,,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,', ',')) # try to remove extra ,'s

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace(',,', ',')) # try to remove extra ,'s  Still left one column with ,, not sure why?

with open('newtest.txt', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace('States:', '')) # remove 'States:'



num_lines = sum(1 for line in open('newtest.txt')) #Find how many lines is in the data
x = num_lines - 10 # subtract 10 as we don't need the header file
y = 10  # 10 lines in the header
max_col=0


while x > 1:
    a = pd.read_csv('newtest.txt', header=None, skiprows=y, nrows = 1,)
    max_col_ln = a.shape[1]


#    print(x) # --- used for testing to see how many lines left

    if max_col_ln > max_col:    # read entire file and find what is the largest column number needed
        max_col = max_col_ln    # as it probably won't be on line 1
    x = x - 1
    y = y + 1



z = 0

with open('newtest2.txt', 'w') as tempfile:
    while max_col > 0:
        tempfile.write('Column' + str(z) +',') # Create ColumnX, ColumnY etc for the maximum number of columns
        max_col = max_col-1
        z = z + 1

with open('newtest2.txt', 'r') as temphead:
    headers = temphead.read().replace('\n', '') #Load headers as an index for columns

with open('newtest.txt', 'r+') as oldfile, open ('newtest3.txt', 'w') as tempdata:
    tempdata.write(headers) # write headers at the top of the new temp file
    for line in oldfile:
        if '*' not in line:
           tempdata.write(line) #write all the data but remove the * at the start of data


newdata = pd.read_table('newtest3.txt') # read the txt as a table
newdata.to_csv('data.csv', quoting=csv.QUOTE_NONE, escapechar='*', index=False) #write to csv using * as escape char and no index 

df = pd.read_csv('data.csv')
df['Time'] = df["Column0*"] + ' ' + df['Column1*'] # combine first 2 columns to make a "Time" column
cols= list(df.columns)
cols = [cols[-1]] + cols[:-1] 
df = df[cols] # swap "time" and Column0 around
df = df.drop(['Column0*', 'Column1*', 'Column2*', 'Column3*', 'Column4*', 'Column5*'], axis=1).to_csv('data.csv', index=False) #remove columns I don't require from the data

with open('data.csv', 'r+')as f:
    content = f.read()
    f.seek(0)
    f.truncate()
    f.write(content.replace('*', '')) # remove the * escape char from earlier and write back to the csv.



os.remove('newtest.txt')
os.remove('newtest2.txt')
os.remove('newtest3.txt') # bit of house keeping after all the changes

Leon · Answer 3 · 2020-02-16T21:13:06.160

0

I am new to python. But you can call Bash scripts to get what you want.

import os
filename="test.txt"
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | tail -n 1"
max = os.system(cmd)
cmd = "awk '{print NF}' " + filename + " |sort -unk 1,1 | head -n 1"
min = os.system(cmd)
print(min, max)

if the separator in your file is not space/TAB, you can add -F ',' (for csv file).

edited Feb 16 '20 at 21:13

answered Feb 16 '20 at 20:43

Leon

444
2
15

finding the maximum number of columns in a file or csv using Python

3 Answers3