Automising the plot of more than a 100 .txt files using pandas, NaN problems

Question

Good afternoon

I am trying to import more than a 100 separate .txt files containing data I want to plot. I would like to automise this process, since doing the same iteration for every individual file is most tedious.

I have read up on how to read multiple .txt files, and found a nice explanation. However, following the example all my data gets imported as NaNs. I read up some more and found a more reliable way of importing .txt files, namely by using pd.read_fwf() as can be seen here.

Although I can at least see my data now, I have no clue how to plot it, since the data is in one column separated by \t, e.g.

0 Extension (mm)\tLoad (kN)\tMachine extension (mm)\tPreload extension

1 0.000000\t\t\t

2 0.152645\t0.000059312\t.....

... etc.

I have tried using different separators in both the pd.read_csv() and pd.read_fwf() including ' ', '\t' and '-s+', but to now avail.

Of course this causes a problem, because now I can not plot my data. Speaking of, I am also not sure how to plot the data in the dataframe. I want to plot each .txt file's data separately on the same scatter plot.

I am very new to stack overflow, so pardon the format of the question if it does not conform to the normal standard. I attach my code below, but unfortunately I can not attach my .txt files. Each .txt file contains about a thousand rows of data. I attach a picture of the general format of all the files. General format of the .txt files.

import numpy as np
import pandas as pd
from matplotlib import pyplot as pp
import os
import glob

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

# get the file names
leggername = [i for i in glob.glob("*.txt")]

# put everything in a dataframe
df = [pd.read_fwf(legger) for legger in leggername]
df

EDIT: the output I get now for the DataFrame is:

[ Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.152645\t0.000059312\t-...
4
... ...
997 76.0173\t0.037706\t0.005...
998
999 76.1699\t0.037709\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

[1002 rows x 4 columns], Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.128151\t0.000043125\t-...
4
... ...
997 63.8191\t0.034977\t-0.00...
998
999 63.9473\t0.034974\t\t
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

[1002 rows x 4 columns], Time (s)\tLoad (kN)\tMachine Extension (mm)\tExtension
0
1 0.000000\t\t\t
2
3 0.174403\t0.000061553\t0...
4
... ...
997 86.8529\t0.036093\t-0.00...
998
999 87.0273\t\t-0.0059160\t-...
1000
1001

   from  Preload  (mm)

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
... ... ... ...
997 NaN NaN NaN
998 NaN NaN NaN
999 NaN NaN NaN
1000 NaN NaN NaN
1001 NaN NaN NaN

... etc

Upload a sample of the dataframe with messed up formatting please, I will try to help you split it [how to upload a sample df](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — Patryk Kowalski, Jan 20 '22 at 11:06
What is the question here? How to read multiple text files or that you have NaN values after importing the text files? — Mr. T, Jan 20 '22 at 11:07
`df = [pd.read_fwf(legger) for legger in leggername]` does not result in a dataframe, but a list of dataframes. You'll need to concatenate those: `df = pd.concat([pd.read_fwf(legger) for legger in leggername])` may work (if all individual dataframes have the same structure). — 9769953, Jan 20 '22 at 11:10
For convenience, just use `leggername = glob.glob("*.txt")` instead of `leggername = [i for i in glob.glob("*.txt")]`. The list comprehension is basically redundant. — 9769953, Jan 20 '22 at 11:11
Can you ignore the first row with the `0.00000` value? That would make things a lot easier. — 9769953, Jan 20 '22 at 11:14
"I want to plot each .txt file's data separately on the same scatter plot.": *how* separately? Different symbol, color, or just all the same (in which case there isn't much "separately")? — 9769953, Jan 20 '22 at 11:15
@Mr.T My question is how to get rid of the \t between the correctly read .txt files, as well as the NaNs in the incorrectly .txt files, and then how to plot the data when read correctly. — Philip de Bruin, Jan 20 '22 at 11:15
You've just made your question unreadable, with all the output. — 9769953, Jan 20 '22 at 11:18
@9769953 Thank you for the suggestions, your first suggestion puts everything in one dataframe, which I believe will assist a lot in plotting. As for deleting a few rows: no problem, there is enough data so that all the necessary information is still available. — Philip de Bruin, Jan 20 '22 at 11:19
Ok, then your problem might be easily solved. I would think the very first data row (with the single 0.0000) would be important to keep, but if it can be ignored, all the easier. — 9769953, Jan 20 '22 at 11:20
@9769953 I want to plot them separately in different colours as well as symbols. The graphs will be used for a thesis, so the different samples should be distinguishable when printed in black and white. I do realise that this will probably limit the amount of samples I have on the same graph, but once the code is sorted I can play around with that. — Philip de Bruin, Jan 20 '22 at 11:21
If you have a 100 files, there will not be enough distinct colors or symbols to see the separate samples in a single plot. — 9769953, Jan 20 '22 at 11:25

9769953 · Accepted Answer · 2022-01-23T15:34:03.537

0

The basic gist is to skip the first data row (that has a single value in it), then read the individual files with pd.read_csv, using tab as the separator, and stack them together.

There is, however, a more problematic issue: the data files turn out to be UTF-16 encoded (the binary data show a NUL character at the even positions), but there is no byte-order-mark (BOM) to indicate this. As a result, you can't specify the encoding in read_csv, but have to manually read each file as binary, then decode it with UTF-16 to a string, then feed that string to read_csv. Since the latter requires a filename or IO-stream, the text data needs to be put into a StringIO object first (or save the corrected data to disk first, then read the corrected file; might not be a bad idea).

import pandas as pd
import os
import glob
import io

# change the working directory
os.chdir(r"C:\Users\Philip de Bruin\Desktop\Universiteit van Pretoria\Nagraads\sterktetoetse_basislyn\trektoetse\speel")

dfs = []
for filename in glob.glob("*.txt"):
    with open(filename, 'rb') as fp:
        data = fp.read()  # a single file should fit in memory just fine
    # Decode the UTF-16 data that is missing a BOM
    string = data.decode('UTF-16')
    # And put it into a stream, for ease-of-use with `read_csv`
    stream = io.StringIO(string) 

    # Read the data from the, now properly decoded, stream
    # Skip the single-value row, and use tabs as separators
    df = pd.read_csv(stream, sep='\t', skiprows=[1])

    # To keep track of the individual files, add an "origin" column
    # with its value set to the corresponding filename
    df['origin'] = filename
    dfs.append(df)

# Concate all dataframes (default is to stack the rows)
df = pd.concat(dfs)


# For a quick and dirty plot, you can enjoy the power of Seaborn
import seaborn as sns
# Use appropriate (full) column names, and use the 'origin' 
# column for the hue and symbol
sns.scatterplot(data=df, x='Time (s)', y='Machine Extension (mm)', hue='origin', style='origin')

Seaborn's scatterplot documentation.

edited Jan 23 '22 at 15:34

answered Jan 20 '22 at 11:24

9769953

10,344
3
26
37

I get everything NaN now, although the dataframe looks much more ordered. Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 0 NaN NaN NaN NaN 1 NaN NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN NaN NaN NaN 4 NaN NaN NaN NaN ... ... ... ... ... 995 NaN NaN NaN NaN 996 NaN NaN NaN NaN 997 NaN NaN NaN NaN 998 NaN NaN NaN NaN 999 NaN NaN NaN NaN 8800 rows × 4 columns – Philip de Bruin Jan 20 '22 at 11:32
Yes, sorry, the '\t' comment was a copying error – Philip de Bruin Jan 20 '22 at 11:33
I looked at your P1T01 example file, and with the above code, you shouldn't get NaNs. Try with `glob.glob("P1T01.txt")` (or whatever the file is called exactly) first; the error might be caused by another, incompatible file. – 9769953 Jan 20 '22 at 11:35
I still get the same, only now the rows are down to 1000, which is the same as the amount of rows in the .txt file. – Philip de Bruin Jan 20 '22 at 11:51
Is there any way you can (and are allowed to) make a few of those input text files available to play around with? There's probably something about the files that we can't see that's causing the NaNs. – 9769953 Jan 20 '22 at 11:53
@PhilipdeBruin If you have a GitHub or GitLab account, that is one possible place. Dropbox might be another one. Depending on the size, you could copy-paste an entire file into a [pastebin](https://pastebin.com/) anonymously. There are probably a few others like that. – 9769953 Jan 21 '22 at 08:35
Yes, of course. I can not find a way to upload it to Stack Overflow, so here is a link to the folder on my Google Drive. https://drive.google.com/drive/folders/19C-kyW5Ei8PBQ6ozw3IRuWz8D7kxnwgB?usp=sharing – Philip de Bruin Jan 21 '22 at 08:46
@PhilipdeBruin I think I've found the issue (and solved it), but to verify that Google didn't mess up the files: could you copy-paste the output of `print(open('P1T01.txt', 'rb').read()[:8])` in a comment below? (If you're running Python from a command line, you could just do `python -c "print(open('P1T01.txt', 'rb').read()[:8])"`. – 9769953 Jan 21 '22 at 10:56
Sorry for the delayed response. If I print the first command in Jupyter Notebook, I get the following output: b'T\x00i\x00m\x00e\x00' – Philip de Bruin Jan 23 '22 at 13:53
Thanks, that confirms what I'm getting, and what I think is the issue (mainly) behind your problem with the NaNs. – 9769953 Jan 23 '22 at 15:32
@PhilipdeBruin See my updated answer. – 9769953 Jan 23 '22 at 15:34
Thank you so much, I would never figure that out on my own, it works like a charm! – Philip de Bruin Jan 24 '22 at 06:29
It is a weird and unexpected thing, so it's no wonder you got stumped. You should really check how you get your input data files; it would appear some software that writes or processes these files is faulty. Perhaps there is a configuration option in a piece of software that ensures the output is UTF-8 instead of (malformed) UTF-16 (UTF-8 is generally considered better), which may save headaches and debugging time in the future. (Unless, of course, this is just a one-off, caused in some weird, unreproducible, way.) – 9769953 Jan 24 '22 at 06:44
Thank you, I will have a look. I suspect that the computer connected to the machine doing the tests are at fault, it is both an old machine and old computer. – Philip de Bruin Jan 25 '22 at 10:54

Automising the plot of more than a 100 .txt files using pandas, NaN problems

1 Answers1