How to calculate the standard deviationfrom a text (.txt) file?

Question

Category;currency;sellerRating;Duration;endDay;ClosePrice;OpenPrice;Competitive?
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Automotive;US;3115;7;Tue;0,01;0,01;No
Automotive;US;3115;7;Tue;0,01;0,01;No
Automotive;US;3115;7;Tue;0,01;0,01;Yes

There is in the actual file no whitspaces, but otherwise it would display wrong. I want to calculate the standard dividation from each categorie.

I tried to use this: statistics.stdev() but that does not work. Can anyone help me and when you have the awnser can you explain it so I can learn.

from csv import DictReader

from collections import defaultdict
from statistics import median

from locale import setlocale
from locale import LC_ALL
from locale import atof

setlocale(LC_ALL, 'Dutch_Netherlands.1252')

median_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice'
print ("Mediaan : ")
data = defaultdict(list)
with open('bijlage.txt') as f:
    csvreader = DictReader(f, delimiter=';')
    for dic in csvreader:
        for header, value in dic.items():
            data[header].append(value)

for median_name in median_names:
    med = median(map(atof, data[median_name]))
    print('{:<13} {:>10}'.format(median_name, med))

from collections import defaultdict
import csv
import locale
import statistics
from pprint import pprint, pformat

import locale

locale.setlocale(locale.LC_ALL, 'Dutch_Netherlands.1252')

avg_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice'
averages = {avg_name: 0 for avg_name in avg_names}

seller_ratings = defaultdict(list)

num_values = 0
with open('bijlage.txt', newline='') as bestand:
     csvreader = csv.DictReader(bestand, delimiter=';')
     for row in csvreader:
        num_values += 1
        for avg_name in avg_names:
             averages[avg_name] += locale.atof(row[avg_name])

seller_ratings[row['Category']].append(locale.atof(row['sellerRating']))

for avg_name, total in averages.items():
    averages[avg_name] = total / num_values

print()
print('Averages:')
for avg_name in avg_names:
    rounded = locale.format_string('%.2f', round(averages[avg_name], 2),
                               grouping=True)
    print('  {:<13} {:>10}'.format(avg_name, rounded))

modes = {}
for category, values in seller_ratings.items():
    try:
        modes[category] = statistics.mode(values)
    except statistics.StatisticsError:
        modes[category] = None  # No unique mode.

print()
print('Modes:')
for category, mode in modes.items():
    if mode is None:
         print('  {:<20} {:>10}'.format(category, '-'))
    else:
        rounded = locale.format_string('%.2f', round(mode, 2), grouping=True)
        print('  {:<20} {:>10}'.format(category, rounded))

When you say that your code doesn't work, can you explain what's wrong? In what way it fails? — Yakov Dan, Jan 07 '19 at 12:51
What do you mean by "does not work"? How about the reader - does it properly read all your attributes? The easiest way to calculate the std is using `numpy`. — offeltoffel, Jan 07 '19 at 12:51
The primary problem is the formatting of this data. If the data were in CSV format, it would be trivial to calculate mean, std dev, and other stats in python (pandas), R, or other tools. Basically, you have an ETL (Extract-Transform-Load) problem, not a standard deviation problem. — Paul, Jan 07 '19 at 12:52
The newly updated data is a CSV format with `;` separator and non-standard decimal (`,`). The options to [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) might be able to handle it. You could also try changing `,` to `.` and then `;` to `,`. — Paul, Jan 07 '19 at 12:56

score 2 · Answer 1 · answered Jan 07 '19 at 12:58

In your previous questions, it was already described how to get the average, median and stuff like that: https://stackoverflow.com/a/54021108/8181134
Using the same, but than the .std() function, you can get the standard deviation:

import pandas as pd
df = pd.read_csv('bijlage.csv', delimiter=';', decimal=',')  # 'bijlage.txt' in your case
sellerRating_std = df['sellerRating'].std()
print('Seller rating standard deviation: {}'.format(sellerRating_std)

score 0 · Answer 2 · answered Jan 07 '19 at 13:07

First of all, please note that median_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice' does not do what you probably expect here.

What you need is to assign a tuple over which you iterate later, like this: median_names = ('sellerRating', 'Duration', 'ClosePrice', 'OpenPrice')

having done that, you can compute the standard deviation just like you've computed the median:

from csv import DictReader

from collections import defaultdict
from statistics import median

from locale import setlocale
from locale import LC_ALL
from locale import atof

setlocale(LC_ALL, 'Dutch_Netherlands.1252')

stddev_names = ('sellerRating', 'Duration', 'ClosePrice', 'OpenPrice')
print ("std dev : ")
data = defaultdict(list)
with open('bijlage.txt') as f:
    csvreader = DictReader(f, delimiter=';')
    for dic in csvreader:
        for header, value in dic.items():
            data[header].append(value)

for name in stddev_name:
    stddev_val = stdev(map(atof, data[name]))
    print('{:<13} {:>10}'.format(name, stddev_val))

score 0 · Answer 3 · answered Jan 07 '19 at 13:14

Your first way (for median) is the way to go is you want to use the statistics module:

setlocale(LC_ALL, 'Dutch_Netherlands.1252')

median_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice'
print ("Mediaan : ")
data = defaultdict(list)
with open('bijlage.txt') as f:
    csvreader = DictReader(f, delimiter=';')
    for dic in csvreader:
        for header, value in dic.items():
            data[header].append(value)

for median_name in median_names:
    med = median(map(atof, data[median_name]))
    print('{:<13} {:>10}'.format(median_name, med))

This part was unchanged, you just have to process the stdev immediately after it, because you can use the same data dictionnary of lists:

from statistics import stdev
print("\nStd Dev (sample)")
for median_name in median_names:
    std= stdev(map(atof, data[median_name]))
    print('{:<13} {:>10}'.format(median_name, std))

How to calculate the standard deviationfrom a text (.txt) file?

3 Answers3