Generating scatter plot for a .txt file

Question

I'm attempting to plot a scatter plot by extracting information from a txt file where the x-axis is the time taken for my test to finish running and the Y-axis is the time taken for each successful pinging command, which I am able to do using the following block of code:

import matplotlib.pyplot as plt

#Read data from file
data = open("EDS4008_2023_04_12.txt",'r')
list = data.readlines()

#Extract time and ping values from the data
time = []
ping_time = []
index = 0

for string in list:
    #print(string)
    index += 1
    if (" time" in string) and (" bytes" in string):
        var1 = string.split(' ')[8]
        #print(var1)
        if "=" in var1:
            t = var1.split("=")[1].split("m")[0]
            ping_time.append(t)
            time.append(index / 60)
        elif "<" in var1:
            t = var1.split("<")[1].split("m")[0]
            ping_time.append(t)
            time.append(index / 60)
        else:
            print("error")
            ping_time.append('0')
            time.append(index / 60)
    else:
        print("skip this line")
        ping_time.append('0')
        time.append(index / 60)


print(ping_time)
print(time)

plt.scatter(time, ping_time)

plt.xlabel('Time (minutes)')
plt.ylabel('Ping Time (ms)')
plt.title('Pinging Duration')

plt.show()

Here is the scatter plot:

I wanted to know if there was a way for me to remove all the plot points seen at 0 on the Y-axis, and how can I plot a horizontal line that represents the max and min of the readings seen in the txt file that correspond to the Y-axis readings so that the final graph will look something like this:

Here is an example of what the txt file looks like:

Your question actually is about two things (not related to how to generate a scatter plot from data in a txt file): (1) how to select data from a list based on a condition (2) how to add horizontal lines to a plot. The answer to (1) can be found e.g. in this post: (https://stackoverflow.com/questions/3030480/how-do-i-select-elements-of-an-array-given-condition) In your case it should be something like: `plt.scatter(time[ping_time>0], ping_time[ping_time>0])` The answer to (2) is using `axhline`, see (https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axhline.html). — spyros, Apr 18 '23 at 23:55
For the first point your method didn't work as ping_time is a list and I get this error when I try it: TypeError: '>' not supported between instances of 'list' and 'str' — H A, Apr 19 '23 at 00:08
① You can use Numpy, please consider that Matplotlib imports Numpy, so importing Numpy yourself is not particularly wasteful ② If you absolutely desire to avoid Numpy, you can collect the "good" data as follows: `x,y=zip(*((x,y)for x,y in zip(t,ping_time)if y>0))` — gboffi, Apr 19 '23 at 20:49

score 0 · Answer 1 · answered Apr 21 '23 at 21:26

Since you are constructing your data points after reading the data from the input file, if you don't want the data with a 0 value on the Y-axis, why don't you just don't add them to your data points.

Without doing too much modifications to your code it could look like:

import matplotlib.pyplot as plt

#Read data from file
#data = open("EDS4008_2023_04_12.txt",'r')
input_file = open("input_data.txt")
input_lines = input_file.readlines()

#Extract time and ping values from the data
time = []
ping_time = []
index = 0

for current_line in input_lines:
    #print(string)
    index += 1
    if (" time" in current_line) and (" bytes" in current_line):
        var1 = current_line.split(' ')[8]
        #print(var1)
        if "=" in var1:
            t = var1.split("=")[1].split("m")[0]
            ping_time.append(int(t))
            time.append(index / 60)
        elif "<" in var1:
            t = var1.split("<")[1].split("m")[0]
            ping_time.append(int(t))
            time.append(index / 60)
        else:
            print("error")
            # Don't add this point
            # ping_time.append(0)
            # time.append(index / 60)
    else:
        print("skip this line")
        # Don't add this point
        # ping_time.append(0)
        # time.append(index / 60)


print(ping_time)
print(time)

max_value = max(ping_time)
min_value = min(ping_time)

plt.scatter(time, ping_time)

plt.xlabel('Time (minutes)')
plt.ylabel('Ping Time (ms)')
plt.title('Pinging Duration')
plt.axhline(max_value)
plt.axhline(min_value)

plt.show()

To draw the lines for the max and min of the readings, we use these functions:

I would also stress that you should not create variables with generic names because some of these names are already used in the standard library. For example these names already exists:

list() is a fondamental class in Python
time is a module of the standard library
string is a module of the standard library

But we can do better.

Whenever you have to extract data from text, regular expressions are one tool that is often useful for this.

import re

import matplotlib.pyplot as plt

matcher = re.compile(r'bytes=\d+ time[=<](?P<duration>\d+)ms')

with open("input_data.txt") as input_file:
    ping_time = []
    timestamps = []
    for (line_number, current_line) in enumerate(input_file):
        if line_matched := matcher.search(current_line):
            timestamps.append(line_number / 60)
            ping_time.append(int(line_matched.group('duration')))
        else:
            print('Skip this line')

print(ping_time)
print(timestamps)

max_value = max(ping_time)
min_value = min(ping_time)

plt.scatter(timestamps, ping_time)

plt.xlabel('Time (minutes)')
plt.ylabel('Ping Time (ms)')
plt.title('Pinging Duration')
plt.axhline(max_value)
plt.axhline(min_value)

plt.show()

To understand this code you will need to read on:

Regular expressions using the re module
enumerate()
Using context managers to make sure a file is closed when you don't need it anymore as described in the second example of this section of the official tutorial

An advanced way of doing it is using generators expressions and zip() in addition to regular expressions to create code that will be a little bit more compact while not keeping all the content of the file in memory.

import re

import matplotlib.pyplot as plt

matcher = re.compile(r'bytes=\d+ time[=<](?P<duration>\d+)ms')

with open("input_data.txt") as input_file:
    lines_matched = (matcher.search(current_line) for current_line in input_file)
    data_points = (
        (x / 60, int(current_match.group('duration')))
        for (x, current_match) in enumerate(lines_matched)
        if current_match
    )
    timestamps, ping_time = zip(*data_points)

print(ping_time)
print(timestamps)

max_value = max(ping_time)
min_value = min(ping_time)

plt.scatter(timestamps, ping_time)

plt.xlabel('Time (minutes)')
plt.ylabel('Ping Time (ms)')
plt.title('Pinging Duration')
plt.axhline(max_value)
plt.axhline(min_value)

plt.show()

Is is what you were looking for?

Generating scatter plot for a .txt file

1 Answers1