how to extract specific data from a csv file with given parameters?

Question

I want to extract Neutral words from the given csv file (to a separate .txt file), but I'm fairly new to python and don't know much about file handling. I could not find a neutral words dataset, but after searching here and there, this is what I was able to find.

Here is the Gtihub project from where I want to extract data (just in case anyone needs to know) : hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis

Neutral Words
Word     Sentiment Score
a        0.0125160264947
the      0.00423728459134
it      -0.0294755274737
and      0.0810574365028
an       0.0318918766949
or      -0.274298468178
normal  -0.0270787859177

So basically I want to extract only those words (text) from csv where the numeric value is 0.something.

May we use libraries like Pandas, or should answers be limited to the standard vanilla libraries? — bendl, Apr 02 '18 at 16:21
@eagle here you go, https://github.com/hoffman-prezioso-projects/Amazon_Review_Sentiment_Analysis/raw/master/results/sentiment_dictionary.csv — ANiK3T, Apr 02 '18 at 16:25
just the list of words, where the corresponding numeric value is between 1 and -1, basically 0.x — ANiK3T, Apr 02 '18 at 16:47

bendl · Accepted Answer · 2018-04-03T17:39:54.080

2

Even without using any libraries, this is fairly easy with the csv you're using.

First open the file (I'm going to assume you have the path saved in the variable filename), then read the file with the readlines() function, and then filter out according to the condition you give.

with open(filename, 'r') as csv:                         # Open the file for reading
    rows = [line.split(',') for line in csv.readlines()] # Read each the file in lines, and split on commas
    filter = [line[0] for line in rows if abs(float(line[1])) < 1]   
                                                         # Filter out all lines where the second value is not equal to 1

This is now the accepted answer, so I'm adding a disclaimer. There are numerous reasons why this code should not be applied to other CSVs without thought.

It reads the entire CSV in memory
It does not account for e.g. quoting

It is acceptable for very simple CSVs but the other answers here are better if you cannot be certain that the CSV won't break this code.

edited Apr 03 '18 at 17:39

answered Apr 02 '18 at 16:30

bendl

1,583
1
18
41

this is a bad idea, loading the whole file in memory, what if it's large? also this only selects those that have a neutral sentiment – eagle Apr 02 '18 at 16:31
@eagle OP requested only rows where the value is 0, the file is of a known size and format and he did not specify if he had access to pandas. This was meant as instruction on how this is done without libraries more than a general solution. Other answers have been provided that give a general solution. – bendl Apr 02 '18 at 16:35
Reading a CSV file is not so easy, what about quoting for example? – ChatterOne Apr 02 '18 at 16:35
I'm sorry i wasn't able to convey properly what i wanted. i just want those text values whose corresponding numeric value is between 1 and -1. anything which is 0.xx or -0.xx – ANiK3T Apr 02 '18 at 16:50
@ANiK3T I have changed my answer to reflect this. That being said, the other comments here should be warning enough if you try to apply this code to any random csv. Hopefully this helps you understand what goes into csv parsing though :) – bendl Apr 02 '18 at 17:05

score 1 · Answer 2 · answered Apr 02 '18 at 17:47

Here is one way to do it with only vanilla libs and not holding the whole file in memory

import csv

def get_vals(filename):
    with open(filename, 'rb') as fin:
        reader = csv.reader(fin)
        for line in reader:
            if line[-1] <= 0:
                yield line[0]

words = get_vals(filename)

for word in words:
    do stuff...

score 0 · Answer 3 · answered Apr 02 '18 at 16:28

Use pandas like so:

import pandas
df = pandas.read_csv("yourfile.csv")
df.columns = ['word', 'sentiment']

to choose words by sentiment:

positive = df[df['sentiment'] > 0]['word']
negative = df[df['sentiment'] < 0]['word']
neutral = df[df['sentiment'] == 0]['word']

score 0 · Answer 4 · answered Apr 02 '18 at 16:33

If you don't want to use any additional libraries, you can try with csv module. Note that delimiter='\t' can be different in your case.

import csv

f = open('name.txt', 'r')
reader = csv.reader(f, delimiter='\t', quoting=csv.QUOTE_NONE)
for row in reader:
   if(float(row[1]) > 0.0):
      print(row[0] + ' ' row[1])

how to extract specific data from a csv file with given parameters?

4 Answers4

Linked