2

I've run into an issue when trying to extract values (in order to count them) from a .csv file while using a for loop to go through a list to try and find the correct values.

The .csv file is structured as follows:

word,pleasantness,activation,imagery
a,2.0000,1.3846,1.0
abandon,1.0000,2.3750,2.4
abandoned,1.1429,2.1000,3.0
abandonment,1.0000,2.0000,1.4
etc...

The first column contains a list of ~9000 words and the 3 others columns contain values that are of linguistic relevance to that specific word.

I used pandas to create a dataframe:

df = pd.read_csv("dictionary.csv", sep=',')

I've also got a text files which I've turned into a list:

read_file = open(textfile)
data = read_file.read().split()

Now, my goal is to have the program go through each word in the list and every time one of those words is encountered in the first column of the .csv file it will add its values to the existing variables. And so on until it's reached the end of the list.

count = 0
pleasantness = 0
activation = 0
imagery = 0

for w in data:
    count = count + 1
    if w in df.word:
        pleasantness = pleasantness + df.pleasantness
        activation = activation + df.activation
        imagery = imagery + df.imagery

print(count, pleasantness, activation, imagery)

This is the best I've been able to come up with and it clearly doesn't work; by the end of it the variables are all still 0.

Does anyone have a clue as to how to do this? It naturally doesn't have to be done using something similar to this approach; I merely care about getting the results.

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Fillask
  • 23
  • 3

1 Answers1

2

IIUC, given you have a .csv such as:

z = StringIO("""word,pleasantness,activation,imagery
a,2.0000,1.3846,1.0
abandon,1.0000,2.3750,2.4
abandoned,1.1429,2.1000,3.0
abandonment,1.0000,2.0000,1.4""")

df = pd.read_csv(z)

which yields

>>> df
    word        pleasantness    activation  imagery
0   a           2.0000          1.3846      1.0
1   abandon     1.0000          2.3750      2.4
2   abandoned   1.1429          2.1000      3.0
3   abandonment 1.0000          2.0000      1.4

and a text such as

text = ("Lorem abandon ipsum dolor sit amet abandonment , consectetur adipiscing elit. abandon Maecenas consequat accumsan lacus. Duis justo nunc, mattis non ante a, convallis luctus eros. Sed sed urna sed magna auctor sagittis eu id magna. Maecenas leo nunc, tincidunt ut sagittis quis, porttitor sit amet ligula. Nunc faucibus ante ac blandit porta")

data = np.array(text.split())

which yields

>>> data

['Lorem' 'abandon' 'ipsum' 'dolor' 'sit' 'amet' 'abandonment' ','
 'consectetur' 'adipiscing' 'elit.' 'abandon' 'Maecenas' 'consequat'
 'accumsan' 'lacus.' 'Duis' 'justo' 'nunc,' 'mattis' 'non' 'ante' 'a,'
 'convallis' 'luctus' 'eros.' 'Sed' 'sed' 'urna' 'sed' 'magna' 'auctor'
 'sagittis' 'eu' 'id' 'magna.' 'Maecenas' 'leo' 'nunc,' 'tincidunt'.  'ut'
 'sagittis' 'quis,' 'porttitor' 'sit' 'amet' 'ligula.' 'Nunc' 'faucibus'
 'ante' 'ac' 'blandit' 'porta']

You can use numpy.isin and collections.Counter to be auxiliaries in the processing:

>>> d = Counter(data[np.isin(data, df.word)])
>>> d
Counter({'abandon': 2, 'abandonment': 1})

and run through the counted values

pleasantness, activation, imagery = (0,0,0)
for k,v in d.items():
    values = df.loc[df.word == k]
    pleasantness += values["pleasantness"].item()*v
    activation   += values["activation"].item()*v
    imagery      += values["imagery"].item()*v

Which would yield, for this text,

print(pleasantness, activation, imagery)
3.0   6.75   6.2

Your total count would simply be

print(sum(d.values()))
3

If you want to avoid the looping through the Counter, you can build a new data frame, such as

ndf = pd.merge(pd.DataFrame(dict(d), index=[0]).T, 
               df.set_index("word"), left_index=True, right_index=True)

which is

>>> ndf   
            count   pleasantness    activation  imagery
abandon     2       1.0             2.375       2.4
abandonment 1       1.0             2.000       1.4

and multiply count through the rest of the rows

ndf.apply(lambda k: k[0]*k[1:], 1)

to get

                pleasantness    activation  imagery
abandon         2.0             4.75        4.8
abandonment     1.0             2.00        1.4

Now you can just play with pandas bulit-in functions, such as .sum()

pleasantness    3.00
activation      6.75
imagery         6.20
dtype: float64
rafaelc
  • 57,686
  • 15
  • 58
  • 82
  • 1
    Works brilliantly! Had to muck about to make it work with the rest of my code (the text file is read in using input from the user), but it did the job once I managed that. One point of comment, though, for others who may use this: "The StringIO and cStringIO modules are gone. Instead, import the io module and use io.StringIO or io.BytesIO for text and data respectively." as found here: https://stackoverflow.com/questions/11914472/stringio-in-python3 And 1 question: can I somehow shorten the df input? Because I have 9000 lines in that csv; would crowd the file quite a bit. – Fillask May 17 '18 at 20:29
  • @Fillask you should keep to your `df.read_csv(filename)`. I just used `StringIO` and a small `df` to make your problem small and reproducible :) Take a look at [mcve] . But you definetely should **not** copy/paste all your 9000 rows into your code! Just keep that in a text file and use pd.read_csv directly on the file. – rafaelc May 17 '18 at 20:34
  • Figured as much :p The reason I asked is because I got this error "C:\Users\Fillask\Anaconda3\lib\site-packages\numpy\lib\arraysetops.py:466: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison" when running it with the df that I had, whereas I got no error with the StringIO example you gave. Many thanks for the help, however! – Fillask May 17 '18 at 20:42
  • One second issue: this doesn't appear to either register or count words that occur twice; it only checks for unique words, it seems. And indeed, I've noticed; it's a right pain :p I thoroughly appreciate the effort and time you've put into this, but technically the less comprehensive answer is the more immediately useful one, which irks me to no end. – Fillask May 17 '18 at 21:07
  • Also, I managed to get it to the point where it does still give the aforementioned error, but it also provides the counts, which is less of an issue. – Fillask May 17 '18 at 21:12
  • @Fillask the `Counter` does exactly that. It counts how many times that word appears. When you multiply by `*v` in the foor loop, you are weighting your values accourding to the counts :) – rafaelc May 17 '18 at 21:19
  • There's just one possible flaw with the way you read your file. If there is a comma next to a word, it would process it as a different word because of `.split()`. i.e. if you have `Word1 Word2 Word1,`, this would yield `['Word1', 'Word2', 'Word1,']` – rafaelc May 17 '18 at 21:22
  • 1
    You sneaky you, that *v changes things tremendously! :p I've got it working exactly as I wanted it to, now. All credits go to you, my friend. Many thanks for the help! – Fillask May 17 '18 at 21:26
  • Ah yes, I am aware. I'll be removing (or adding spaces to) punctuation and making everything lower case and such later on. I just want to have the core program working, which it is, right now. Again, thanks! :) – Fillask May 17 '18 at 21:27