-1

I have a CSV file whose columns are frequency counts of words, and whose rows are time periods. I want to sum for each column the total frequencies. Then I want to write to a CSV file for sums greater than or equal to 30, the column and row values, thus dropping columns whose sums are less than 30.

Just learning python and pandas. I know it is a simple question, but my knowledge is at that level. Your help is most appreciated.

I can read in the CSV file and compute the column sums.

df = pd.read_csv('data.csv')

Except of data file containing 3,874 columns and 100 rows

df.sum(axis = 0, skipna = True)

Excerpt of sums for columns

I am stuck on how to create the output file so that it looks like the original file but no longer has columns whose sums were less than 30.

I am stuck on how to write to a CSV file each row for each column whose sums are greater than or equal to 30. The layout of the output file would be the same as for the input file. The sums would not be included in the output.

Thanks very much for your help.

So, here is a link showing an excerpt of a file containing 100 rows and 3,857 columns:

bbiasi
  • 1,549
  • 2
  • 15
  • 31
  • Does this help? https://stackoverflow.com/questions/31614804/how-to-delete-a-column-in-pandas-dataframe-based-on-a-condition – Tim May 18 '19 at 00:12

1 Answers1

2

It's easiest to do this in two steps:

1. Filter the DataFrame to just the columns you want to save

df_to_save = df.loc[:, (df.sum(axis=0, skipna=True) >= 30)]

.loc is for picking rows/columns based either on labels or conditions; the syntax is .loc[rows, columns], so : means "take all the rows", and then the second part is the condition on our columns - I've taken the sum you'd given in your question and set it greater than or equal to 30.

2. Save the filtered DataFrame to CSV

df_to_save.to_csv('path/to/write_file.csv', header=True, index=False)

Just put your filepath in as the first argument. header=True means the header labels from the table will be written back out to the file, and index=False means the numbered row labels Pandas automatically created when you read in the CSV won't be included in the export.


See this answer here: How to delete a column in pandas dataframe based on a condition? . Note, the solution for your question doesn't need isnull() before the sum(), as that is specific to their question for counting NaN values.

Tim
  • 1,839
  • 10
  • 18
  • Just corrected this, as `headers=True` in step 2. should be `header=True` (ie singular) – Tim May 18 '19 at 00:46