0

I have a dataset in following format.

row_num;locale;day_of_week;hour_of_day;agent_id;entry_page;path_id_set;traffic_type;session_durantion;hits
"988681;L6;Monday;17;1;2111;""31672;0"";6;7037;\N" "988680;L2;Thursday;22;10;2113;""31965;0"";2;49;14" "988679;L4;Saturday;21;2;2100;""0;78464"";1;1892;14" "988678;L3;Saturday;19;8;2113;51462;6;0;1;\N"

I want it to be in following format :

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N
988680 L2 Thursday 22 10 2113 31965 0 2 49 14
988679 L4 Saturday 21 2 2100 0 78464 1 1892 14
988678 L3 Saturday 19 8 2113 51462 6 0 1 N

I tried with the following code :

import pandas as pd

df = pd.read_csv("C:\Users\Rahhy\Desktop\trivago.csv", delimiter = ";")

But I am getting a error as :

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
  • The path looks fishy. – DirtyBit Apr 11 '19 at 13:05
  • @DirtyBit Meaning ?? – Rahul Kumar Apr 11 '19 at 13:06
  • Okay, you edited it now. – DirtyBit Apr 11 '19 at 13:07
  • @pistol2myhead I tried those solutions earlier given in the link you have shared. None of them worked so I posted my query. – Rahul Kumar Apr 11 '19 at 13:09
  • 1
    @RahulKumar The error arises because you didn't escape the backslashes in the path. The solutions in the linked question fix this. Please use one of them, run the code again and post the new error message. – LcdDrm Apr 11 '19 at 13:12
  • Your real problem is not the `delimiter` but the multiple chars that you need to get rid of. – DirtyBit Apr 11 '19 at 13:15
  • the link for the original datasets https://zealpath-storage.s3.amazonaws.com/cases/Build%20a%20predictive%20%27hits%27%20model%20%28Copy%29-428/ML_Data_Scientist_Case_Study_Data.zip?Signature=wFl1S51c7auTogQvrwRDMuAW9dk%3D&Expires=1554992389&AWSAccessKeyId=AKIAJTXGMWYZOKXO4LZA – Rahul Kumar Apr 11 '19 at 13:21
  • Possible duplicate of ["Unicode Error "unicodeescape" codec can't decode bytes... Cannot open text files in Python 3](https://stackoverflow.com/questions/1347791/unicode-error-unicodeescape-codec-cant-decode-bytes-cannot-open-text-file) – legoscia Apr 11 '19 at 13:26

2 Answers2

0

Using replace():

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
print(contents)

OUTPUT:

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N

EDIT:

You can open a file, read its content, replace the unwanted chars. write the new contents to the file and then read it through pd.read_csv:

with open("data_test.csv", "r") as fileObj:
    contents = fileObj.read().replace(';',' ').replace('\\', '').replace('"', '')
# print(contents)

with open("data_test.csv", "w+") as fileObj2:
    fileObj2.write(contents)

import pandas as pd
df = pd.read_csv(r"data_test.csv", index_col=False)
print(df)

OUTPUT:

row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits
988681 L6 Monday 17 1 2111 31672 0 6 7037 N 988680 L2 Thursday 22 10 2113 31965 0 2 49 14 988679 L4 Saturday 21 2 2100 0 78464 1 1892 14 988678 L3 Saturday 19 8 2113 51462 6 0 1 N
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
  • With pandas , Cant it be imported ?? I want to do some data analysis on it. – Rahul Kumar Apr 11 '19 at 13:12
  • @RahulKumar sure, adding it in a moment. – DirtyBit Apr 11 '19 at 13:18
  • Yes, this works but the no. of variables are 10 and corresponding values are 11... how to deal with it ?? – Rahul Kumar Apr 11 '19 at 13:56
  • @RahulKumar Indeed, works fine. Do you get any errors? – DirtyBit Apr 11 '19 at 14:00
  • NO Errors in execution of above codes but when I try to retrieve a column say `df["hits"]` it is throwing error `KeyError: 'hits'` – Rahul Kumar Apr 11 '19 at 14:09
  • 1
    Use `index_col=False` to not use the first data field for indexing (see my answer or https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) – nitzel Apr 11 '19 at 14:10
  • @DirtyBit since you've replaced all semicolons with whitespaces `read_csv` need's the `delimiter=' '` to parse the csv correctly. (See rejected edit). – nitzel Apr 12 '19 at 07:37
0
import pandas as pd
from io import StringIO

# Load the file to a string (prefix r (raw) to not use \ for escaping)
filename = r'c:\temp\x.csv'
with open(filename, 'r') as file:
    raw_file_content = file.read()

# Remove the quotes which break the CSV file
file_content_without_quotes = raw_file_content.replace('"','')

# Simulate a file with the corrected CSV content
simulated_file = StringIO(file_content_without_quotes)

# Get the CSV as a table with pandas
# Since the first field in each data row shall not be used for indexing we need to set index_col=False
csv_data = pd.read_csv(simulated_file, delimiter = ';', index_col=False)
print(csv_data['hits']) # print some column
csv_data

Since there are 11 data fields and 10 headers only the first 10 fields are used. You'll have to figure out what you want to do with the last one (Values: \N, 14)

Output:

0    7037
1      49
2    1892
3       1

enter image description here

See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

nitzel
  • 1,565
  • 14
  • 14
  • `df = pd.read_csv(simulated_file, delimiter = ';', index_col=False)` works fine but when i type `df["hits"]` it gives this error `KeyError: 'hits'` – Rahul Kumar Apr 11 '19 at 14:15
  • I'm getting the correct output with `df['hits']`. What development environment are you using? Have you made sure to overwrite `df` with the result of my script after running DirtyBit's answer? – nitzel Apr 11 '19 at 14:17
  • Maybe time to check the data you actually read: `print(csv_data)` or `print(csv_data.columns.values)` (these are the column names and it should contain 'hits') – nitzel Apr 11 '19 at 14:28
  • I m on python3.7 and window 10 with spyder ide. Whenever trying to print a column `hits` throwing error `keyerror : 'hits'` – Rahul Kumar Apr 11 '19 at 14:31
  • 1
    `print(csv_data.columns.values)` retrives `['row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits']` I think python here is retriving all separate columns as single column – Rahul Kumar Apr 11 '19 at 14:34
  • Yes result for `csv['row_num locale day_of_week hour_of_day agent_id entry_page path_id_set traffic_type session_durantion hits']` is whole table while csv_data['hits'] is error `Keyerror : hits `.... How to proceed now ? – Rahul Kumar Apr 11 '19 at 14:43
  • I'm pretty sure your file is no longer the original one since you ran DirtyBit's script and replaced it's contents. So the `;` is no longer the separator. Try changing the delimiter to ` `(space) or reset the file to it's original content. – nitzel Apr 11 '19 at 14:59