1

What I want to do is that I have a big .csv file. I want to break down this big csv file into many small files based on the common records in BB column that also contain 1 in the HH column, and all uncommon records that contain 0 in HH column.

As a result, all files will contain common records in BB column that contain 1 in the HH column, and all uncommon records that has no records in BB column and contain 0 in the HH column. The file name should be based on the common record of column 2 (BB). Please take a look below for the scenario.

bigFile.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5     960     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

My expected results files woud be as follows:

53.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      53     115       10     3       3     186     1
   12      53     01e       23     3       2             1
   12      53     0ce       65     1       3             1
   12      53     173       73     4       2             1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

59.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      59     115       0      3       3     186     1
   12      59     125       0      3       3     186     1
   12             68b       95     3       5     334     0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4     215     0
   12             a10       36     5       1             0      

61.csv :

   AA      BB      CC       DD     EE      FF    GG      HH
   12      61     01e       23     3       2             1
   12      61     b6f       0      1       1             1
   12      61     b1b       0      6       5    960      1
   12             68b       95     3       5    334      0
   12             31a       31     2       2             0
   12             221       0      4       5             0
   12             12b       25     5       4    215      0
   12             a10       36     5       1             0      
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
Tofazzal
  • 71
  • 8
  • excuse my ignorance: isn't `if str(row[1]) in row:` always true? – Pynchia Nov 17 '15 at 13:04
  • Please edit your question, it's really hard to understand what you want do. What is the criteria for removing the red rows? – Pynchia Nov 17 '15 at 13:09
  • 1
    Why just the red rows and not the following ones? BTW, do they lack a value for the second column? – Pynchia Nov 17 '15 at 13:30
  • 1
    Please specify what are *uncommon* lines. Currently your question is *unclear*. – Serge Ballesta Nov 17 '15 at 13:37
  • Hello Friends Pynchia and Serge Ballesta, The previous problem has been solved, but I am facing a big porblem that explained above. Please take a look and let me know if you have any idea. Thank you very much for the time. – Tofazzal Nov 18 '15 at 13:44

2 Answers2

2

The following script will produce your expected outcome. It filters the file based on the contents of the first row:

import csv

with open('123.csv', 'r') as f_input, open('output.csv', 'w', newline='') as f_output:
    csv_input = csv.reader(f_input)
    csv_output = csv.writer(f_output)

    header = next(csv_input)
    first_row = next(csv_input)

    csv_output.writerow(header)
    csv_output.writerow(first_row)

    for row in csv_input:
        if row[0] == first_row[0]:
            if row[1] == first_row[1] or row[7] == '0':
                csv_output.writerow(row)

To do this operation on all CSV files in the same folder, the following could be used:

import glob             
import csv
import os

file_list = glob.glob('*.csv')

output_folder = 'temp'  # make sure this folder exists

for file in file_list:
    file_name = os.path.join(output_folder, os.path.basename(file))
    with open(file, 'r') as f_input, open(file_name, 'w', newline='') as f_output:
        csv_input = csv.reader(f_input)
        csv_output = csv.writer(f_output)

        header = next(csv_input)
        first_row = next(csv_input)

        csv_output.writerow(header)
        csv_output.writerow(first_row)

        for row in csv_input:
            if row[0] == first_row[0]:
                if row[1] == first_row[1] or row[7] == '0':
                    csv_output.writerow(row)

This will write all of the output files to a temp folder.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • Glad it worked, I have added a multi file version of the script to the solution. – Martin Evans Nov 18 '15 at 09:01
  • Evens. Thank you very much. Is it possible to keep the same name, i.e., work in the smae files? – Tofazzal Nov 18 '15 at 09:11
  • You would need to specify a different folder. It is not possible to directly write to the file whilst it is being read, and it would not be wise to do that anyway as any problem with the script would cause you to lose your original files. – Martin Evans Nov 18 '15 at 09:15
  • I have updated the script to allow an output folder to be specified. – Martin Evans Nov 18 '15 at 09:20
  • Hello Evans and Other, I can not submit another question with 3 days. Therfore, I have revised the previous question. This is bery big problem that I can not manage dut to very large execution time using for loop. Could you please take a look and tell me if you have any idea. – Tofazzal Nov 18 '15 at 13:35
  • I am not sure why you have a memory problem, the code I gave works a line at a time so in theory it should work on any sized file. It is not possible to update an existing file like this. – Martin Evans Nov 19 '15 at 16:41
  • Hello Evans, Could you please tell me how can I remove header from the code you written? Because I wnat to write into the existing file. – Tofazzal Nov 20 '15 at 02:04
  • Remove the line `csv_output.writerow(header)` – Martin Evans Nov 20 '15 at 06:37
  • Hello Evans, Thank you very much. I have a final request to you. I can take the data that are common to the first record. But for my research purpose, I also need to take the data of the other common records. I have revised my explanaiton accordingly in the above. I sincerely expect your help and suggestion in this regards. I have posted this in a new thread that can be found at

    http://stackoverflow.com/questions/33839540/how-can-i-break-down-a-large-csv-file-into-small-files-based-on-common-records-b
    – Tofazzal Nov 21 '15 at 10:54
  • Evans, Than you very much. I am writing here becasu I cannot make comment due to 50 reputation requirements in the other post where you corrected. One point I forgot to tell you.l That is I need to do the same for the many bigFiles. And store the small file into the "temp" folder. This is like you did in the previous (in the above of this post). Could you please revise the code. Please revise from the following thread: http://stackoverflow.com/questions/33839540/how-can-i-break-down-a-large-csv-file-into-small-files-based-on-common-records-b – Tofazzal Nov 24 '15 at 05:04
  • Evans, Thank you. Again, my explanation may not clear to you. Especially I am considering that I have many big csv files, like bigFile_01.csv, bigFile_02.csv, bigFile_03.csv and so on. I want to create many small csv file from each of this csv file according to the condition I mentioned. Could you please take a look again. In the GG column it produce NULLwhen there is no data, and this NULL start whent 0 starts in the HH column. But we do not need NULL, it should be blank. Could you please take a look on this? Thank you again. – Tofazzal Nov 24 '15 at 13:34
  • Evans, Thank you. The program is working fine. One point is, when I use linux surver, in the GG column, empty cells are automitically filled up with "NULL" value. This is not due to your code, NULL is inserted automitically when I select / import data from the server by Hive query command. Could you please take a look if it is possible to delete NULL in this program? Thank you. – Tofazzal Nov 25 '15 at 08:44
  • Hello Evans, Thank you very much. the program is working fine. A simple request, finally I find that the column "AA" and column "BB" is not required, so I would like to delete this two columns from all the files. Could you please take a look about this matter. Thank you very much for your support. – Tofazzal Nov 25 '15 at 14:35
  • Hello Evans, I have posted a new question in the following link. Please take a look. I want your suggestion and advice to solve the problem. The previous all the programs warks fine. Thank you very much. – Tofazzal Dec 03 '15 at 13:54
  • Link of the new question: http://stackoverflow.com/questions/34066548/delete-or-remove-unexpected-records-and-strings-based-on-multiple-criteria-by-py – Tofazzal Dec 04 '15 at 02:04
0

It seems like you're trying to remove the rows in conv_id that contain 5093,51206.

You can easily achieve that using pandas.


Import sample data

The StringIO simulates the use of read_csv

import pandas as pd
from io import StringIO # only for this example

data = """
adv_id,conv_id,user_id,dwell,sessions
123,50381,4e06,0,3
123,50381,454b,0,239
123,50983,4e06,0,3
123,50983,454b,0,239
123,51026,454b,0,239
123,maf98,45eb,0,1
"""

# Start here
df = pd.read_csv(StringIO(data), delimiter=',')

This results in df being:

   adv_id conv_id user_id  dwell  sessions
0     123   50381    4e06      0         3
1     123   50381    454b      0       239
2     123   50983    4e06      0         3
3     123   50983    454b      0       239
4     123   51026    454b      0       239
5     123   maf98    45eb      0         1

Filter out values

To remove the unwanted rows we can extract the rows that contain those values using isin:

idx = df['conv_id'].isin(['50983','51026'])
df = df[~idx]
# the ~ means to exclude those values

Which will results in:

   adv_id conv_id user_id  dwell  sessions
0     123   50381    4e06      0         3
1     123   50381    454b      0       239
5     123   maf98    45eb      0         1
Leb
  • 15,483
  • 10
  • 56
  • 75