0

I saved a Pandas Dataframe to a csv file. If import it in Azure AutoML it looks like this:


Screenshot from Azure AutoML - Data is messed up (Updated)


It looks fine if I open it with Excel:


Screenshot from Excel - Data looks correct


I export the dataframe with this line:

df.to_csv(r'*static_path*/output/measurements.csv')

Attempted Workarounds:

  • Open in Excel and resave as csv
  • Open in Excel and resave as tsv
  • Switch around Encoding options in AzureML
  • Created the csv and uploaded it to blob storage with the guide from the Microsoft Docs Create a dataset from pandas dataframe
Johannes Schweer
  • 229
  • 1
  • 2
  • 19
  • 1
    Please share the csv file to check, If the data can't be shared please share sample. – Ram Dec 07 '20 at 04:33
  • @Johannes Schweer what was the dataset type used for this file during upload? Could you please try to use Tabular with a custom delimiter if required? – RohitMungi Dec 07 '20 at 05:43
  • Please follow the below to create Tabular Datasets: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-connect-data-ui#create-datasets – Ram Dec 07 '20 at 06:06
  • @RohitMungi-MSFT I tried it tabulator separators as well. Same bad result again. Here is a little sample of my csv and tsv. Valid for one week: https://wetransfer.com/downloads/694a26ca86754a97282081c27dc8238720201207221917/42a4c9 – Johannes Schweer Dec 07 '20 at 22:21

1 Answers1

1

Please, be aware that, as per your screenshot, you are importing your csv in AzureML with semicolon as delimiter, while df.to_csv will output your information with a comma as separator.

Change the delimiter setting in AzureML to comma in your import settings or your Python code to provide the right separator, as indicated below.

Analyzing your files, also be aware that your first column seems to be the dataframe index, included by default by Pandas when exporting to csv.

Please, try instead:

df.to_csv(r'*static_path*/output/measurements.csv', sep=';', index=False)

In any case, it seems that your data contains carriage returns across your text fields. Consider, for example, the chroma_stft field. It contains a carriage return in the exact position where your screenshot shows the value [0.33353573:

enter image description here

As you can see in the image, the pattern presented in the AzureML screenshot match exactly the different carriage returns in your text fields.

This will be very likely the reason of your problem. Probably AzureML is interpreting these carriage returns as actual line endings, and it is splitting your data accordingly, independently of the fact that the text field value is enclosed between quotes.

You need to properly get rid of these intermediate carriage returns, probably replacing them before exporting the information to csv, applying something like this to the different fields with the problem:

df.chroma_stft = df.chroma_stft.str.replace('\r', '')

Please, also review the ... characters that your text fields also include: as indicated by @Ferris in his/her comment, it is probably related with the fact that this field contains a numpy array and this array is being truncated. In addition to the solutions he/she suggested, please, consider playing with the different numpy print options, especially, threshold and linewidth. I think that tweaking them could be of help.

jccampanero
  • 50,989
  • 3
  • 20
  • 49
  • I followed your suggestion. We eliminated "Coolumn1" as the Dataframe index isn't provided anymore. But the rest of the columns is still messed up. – Johannes Schweer Dec 09 '20 at 14:04
  • 1
    I see. Please, can you upload the modified csv file generated by Pandas after removing the index column again? You do not need to remove `Column1`, now this column will correspond to the `song_id` field – jccampanero Dec 09 '20 at 14:16
  • @jcccampanero Here is a little sample: https://wetransfer.com/downloads/dcebc7c9950bc33aa710f25b6603773b20201209142345/dad0a5 – Johannes Schweer Dec 09 '20 at 14:24
  • Thank you Johannes, I will review it. Please, can you update your AutoML screenshot to see how it looks like? – jccampanero Dec 09 '20 at 14:29
  • Thank you very much @JohannesSchweer. Your data, for example, the `chroma_stft` field, contains a carriage return in the exact position where the screenshot shows the value `[0.33353573`. It will be very likely the reason of your problem: AzureML is interpreting this carriage returns as actual line endings and it is splitting your data accordingly. It seems irrelevant the fact that the field value is enclosed between quotes. I updated the answer with that information. – jccampanero Dec 09 '20 at 16:35
  • 2
    @JohannesSchweer your sample file looks like you insert a 2D ndarray in a single cell, so the sample file has a lot of "...", which means it misses some data. One solution: import the raw data(origin data) into Azure AutoML, handle the data(data cleaning, transformation, etc) in Azure, then do the rest job. – Ferris Dec 10 '20 at 02:44
  • Thank you very much for your comment @Ferris, it makes perfect sense. – jccampanero Dec 10 '20 at 10:01
  • I don't know why I lose data. I'm extracting the data by myself from mp3 files with librosa and dump it as NumPy arrays unformatted in the data frame cells. How can I prevent that the arrays get shortened and the '...' are inserted? As far as I see it, this is the main problem? – Johannes Schweer Dec 10 '20 at 13:29
  • Hi Johannes. Yes, I think so. As indicated in the answer, one think you can try is to set the print options for numpy to a greater `threshold` and or `linewidth`. For instance, something like `numpy.set_printoptions(threshold=sys.maxsize)`. You can restore the setting after the csv export by setting `threshold` to `1000`, the default value. Please, see other options [here](https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array-without-truncation). Please, can you try? It should work properly on its own, but I never tested it in the context of a dataframe `to_csv` method. – jccampanero Dec 10 '20 at 14:01
  • Thank you. I'll test around with the numpy settings and share the results :) – Johannes Schweer Dec 11 '20 at 09:19
  • Thanks Johannes. I hope it works properly. Please, do not hesitate to contact me if I can be of any help. – jccampanero Dec 11 '20 at 10:04
  • I followed your suggestions. Maxing the threshold was a great tip. The problem is that for example chroma_stft is a numpy.ndarray, so I can't just str.replace all "new lines". But it is necessary to replace them somehow because that is the main reason why other programs can not read the csv. – Johannes Schweer Dec 14 '20 at 10:50
  • Hi Johannes. And, did you try to adjust also the value of `linewidth`? It is `75` by default, but you can increase that value as necessary. – jccampanero Dec 14 '20 at 11:07
  • Yes, I increased it but it didn't change anything in the csv... – Johannes Schweer Dec 14 '20 at 14:21
  • I am sorry to hear that. I am running out of ideas... Maybe, one last try: if you know the size of your array, you can reshape, or better said, [`ravel`](https://numpy.org/doc/stable/reference/generated/numpy.ravel.html) your array, transform your 2D array in a 1D array. Maybe the line breaks are being introduced per every 2D array row, and this can solve the problem. You can always calculate the size of the array if necessary prior to array conversion using `shape`. If required, you can also store the shape of the original array in a new column of your dataset so you can reshape it back later – jccampanero Dec 14 '20 at 16:22
  • 1
    @JohannesSchweer Were you able to try the proposed solution with numpy `ravel`? – jccampanero Dec 15 '20 at 11:33
  • I don't think azureML could handle the combination of 2d arrays and the shape. Seems like ndarrays are not meant to be exported in a single dataframe cell to csv. Thanks for your effort.. My Solution is to exclude the ndarray features and handle them separately. Thank your effort. – Johannes Schweer Dec 16 '20 at 13:31
  • You are welcome @JohanessSchweer. Yes, I sometimes asked myself which was the point of having the information in that way because, as you realized, it is not a natural Pandas or Azure ML use case. You can import separately both datasets and join then later, or process the information in some other different ways, or take a different approach and use the Azure ML Python SDK directly. I hope you find a solution. – jccampanero Dec 16 '20 at 13:44