0

I have to copy all the rows which contain specific word into an anther csv file.

My file is in .csv and I want to copy all rows which contain the word "Canada" in one of the cells. I have tried the various method given on the internet. But I am unable to copy my rows. My data contains more than 15,000 lines.

Example of my dataset includes:

tweets         date           area  
dbcjhbc    12:4:19         us 
cbhjc      3:3:18          germany
cwecewc    5:6:19          canada
cwec       23:4:19          us
wncwjwk     9:8:18         canada

code is:

import csv

with open('twitter-1.csv', "r" ,encoding="utf8") as f:
    reader = csv.DictReader(f, delimiter=',')
    with open('output.csv', "w") as f_out:
        writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames, delimiter=",")
        writer.writeheader()
        for row in reader:
            if row == 'Canada':
                writer.writerow(row)

But this code is not working and I am getting the error

Error: field larger than field limit (131072)

Dale K
  • 25,246
  • 15
  • 42
  • 71
mark
  • 1
  • 3
  • Please see [ask]. If you've already tried some code, your best approach is to post the code you tried, and describe exactly what problem you saw when you ran it. Questions without code tend to get closed as "too broad". – Tim Williams Jul 15 '19 at 23:38
  • The error is _probably_ that your input data isn't proper CSV, and so when the CSV module tries to parse it, it encounters a lone double quote, and believes the rest of the file contains one huge field with newlines in it. Either clean up your data, or find out exactly how it's formatted. Perhaps see also https://stackoverflow.com/questions/15063936/csv-error-field-larger-than-field-limit-131072 – tripleee Feb 19 '21 at 06:10

4 Answers4

1

I know the question is asking for a solution in Python, but I believe this task can be solved easier with command-line tools.

One-Liner using Bash:

grep 'canada' myFile.csv > outputfile.csv
LoMaPh
  • 1,476
  • 2
  • 20
  • 33
0

You can do this even without the csv module.

# read file and split by newlines (get list of rows)
with open('input.csv', 'r') as f:
    rows = f.read().split('\n')

# loop over rows and append to list if they contain 'canada'
rows_containing_keyword = [row for row in rows if 'canada' in row]

# create and write lines to output file
with open('output.csv', 'w+') as f:
    f.write('\n'.join(rows_containing_keyword))
lol cubes
  • 125
  • 7
0

Assuming your .csv data (twitter-1.csv) looks like this:

tweets,date,area
dbcjhbc,12:4:19,us 
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
cwec,23:4:19,us
wncwjwk,9:8:18,canada

Using numpy:

import numpy as np

# import .csv data (skipping header)
data = np.genfromtxt('twitter-1.csv', delimiter=',', dtype='string', skip_header=1)

# select only rows where the 'area' column is 'canada'
data_canada = data[np.where(data[:,2]=='canada')]

# export the resulting data
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')

foo.csv will contain:

cwecewc,5:6:19,canada
wncwjwk,9:8:18,canada

If you want to search every entry (every column) for canada, then you could use list comprehension. Assume twitter-1.csv contained an occurrence of canada in the tweets column:

tweets,date,area
dbcjhbc,12:4:19,us 
cbhjc,3:3:18,germany
cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada

This will return all rows with any occurrence of canada:

out = [i for i, v in enumerate(data) if 'canada' in v]
data_canada = data[out]
np.savetxt("foo.csv", data_canada, delimiter=',', fmt='%s')

Now, foo.csv will contain:

cwecewc,5:6:19,canada
canada,23:4:19,us
wncwjwk,9:8:18,canada
pjw
  • 2,133
  • 3
  • 27
  • 44
0

All solutions except the grep one (which is probably the fastest if grep is available) load the entire .csv file into memory. Don't do that! You can stream the file and keep only one line in memory at a time.

with open('input.csv', 'r') as if, open('output.csv', 'w') as of:
    for line in if:
        if 'canada' in line:
            of.write(line)

NOTE: I don't actually have python3 on this computer, so there might be a typo on this code. But I'm confident it's more efficient on sufficiently large files than loading the entire file into memory before manipulating it. It would be interesting to see benchmarks.

Ben
  • 5,952
  • 4
  • 33
  • 44