0

I'm trying to clear the csv data for my project which contains news and unnecessary things(such as javascript code). It's the dataset for our project and my job it to filter it and delete unnecessary characters.

The thing I want to do is to find the index of the character inside the row/column and if it's there delete the characters after it(including the character itself).

I have wrote the code to check the index and can replace the exact character, but the problem is that I want to delete all the characters after that character.

I have tried implementing Pandas library to get the data and replace the exact row. But, as seen from the code, it just replaces the exact char with empty. I want to find the index of char(let say "window") and delete the characters that come after "window" char inside the row.

import pandas as pd
import numpy as np
import csv


pathtofile = "t1.csv"
data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0)

print(type(data)) #which gives dataframe as output
print(data.head()) #prints out [id, contetn, date]

sub = 'window._ttzi' #its char array that i'm searching using find()
data["Indexes"]= data["contetn"].str.find(sub)
print(data) #prints the csv file with additional index

data = data.replace('window._ttzi', '')

#data.to_csv("t1edited.csv", encoding = 'utf-8')
print(data)   
Nijat Mursali
  • 930
  • 1
  • 8
  • 20
  • 1
    Do you want to delete the whole row after your string or just up to the next column? – apilat Feb 12 '19 at 15:12
  • "it just replaces the exact char with empty." it's because you're using the `replace()` function to replace what it find with empty look at the second argument (`''`) in your code: `.replace('window._ttzi', '')` – mrbTT Feb 12 '19 at 15:14
  • Hello, @apilat. I want to delete everything that comes after that character inside the column. – Nijat Mursali Feb 12 '19 at 15:14
  • @mrbTT, I know that. The reason I posted this question was because I couldn't figure out how to use find() and replace() functions inside the function to achieve the result. – Nijat Mursali Feb 12 '19 at 15:15
  • if you have the index then it's `data=data[:index]` – anishtain4 Feb 12 '19 at 15:15
  • @NijatMursali, by everything that comes after you mean in the specific position (column + row)? Imagine the dataframe is a excel file. It searches and finds the `'window._ttzi'` at cell B22 as `And then the rabbit came window._ttzi: something something`. You need to keep the cell B22 with only `And then the rabbit came window._ttzi` and remove the rest? If so, did you try this?: https://stackoverflow.com/questions/1178335/in-python-2-4-how-can-i-strip-out-characters-after – mrbTT Feb 12 '19 at 15:16
  • @mrbTT, let say you have the data-set with more than 70k columns. I want to check if the "window" character is in the row[content] and then search in every column if the "window" character is there. If yes, then delete the characters that come after this character. The reason why I want to do is because of filtering. My data contains Javacript file which is unnecessary. – Nijat Mursali Feb 12 '19 at 15:20
  • I would use rstrip, but pandas doesn't have it as far as I know. If you have any recommendations, please share. – Nijat Mursali Feb 12 '19 at 15:26
  • I am not experienced with pandas, but could you use a regex to do `replace(r'window._ttzi.*$', '', regex=True)`? – apilat Feb 12 '19 at 15:33
  • Your way didn't, actually doesn't make sense. It can delete just the exact character. The JavaScript code inside CSV file isn't one or two. There are bunch of different kinds of functions. I have found they all start with "window", so I asked this question. – Nijat Mursali Feb 12 '19 at 15:40
  • 10 minutes - 50 views, but no one knows the problem? That's weird. – Nijat Mursali Feb 12 '19 at 15:41
  • I believe I'm comming with a suboptimal way of doing it. Just a sec – mrbTT Feb 12 '19 at 15:52

2 Answers2

0

As you've said in comments, you want to remove the character from all columns, so you could "simply" loop trough every column and get everything that appears after you character.

So a really not optmized way could be:

# Get a list of all df's columns
columns = df.columns
# dummy array
strings = []

# here is your character, if it is a list, you'll need to adjust the loop bellow
character = 'window._ttzi'

# looping trought each column
for column in columns:
    try:
        # appends everything that will come AFTER the character. Couldn't find a way to keep the character + what's before
        # so will fix it trough another loop later
        strings.append(df[column].str.split(character).str[1].values) # the 1 means after the character
    except AttributeError:
        # column is not string / object so ignore it
        pass

Adjusting the list

# flatten the array of arrays
flat_list = [item for sublist in strings for item in sublist]

# removing nan values
cleaned_list = [x for x in flat_list if str(x) != 'nan']

# Remove duplicates (set())
unique_list = list(set(cleaned_list))

Finally, replacing the original columns with new values, in other words, this will delete unnecessary data

# since we got everything we don't want, will go trough a loop once again but
# this time we will keep everything before the string. 
# instead of the split() you could also use the .replace(string, '') here
for column in columns:
    for string in unique_list:
        try:
            df[column] = df[column].str.split(string).str[0] # the zero means before the character
        except AttributeError:
        # column is not string / object
            pass
mrbTT
  • 1,399
  • 1
  • 18
  • 31
  • Thank you for your solution and comments, however as you have mentioned it's not optimal solution. I have searched about rstrip and managed to filter the data. Have a nice day. – Nijat Mursali Feb 12 '19 at 16:22
  • Thanks for your reply @NijatMursali. Could you elaborate more on how your `rstrip()` solution is accomplishing all the steps I've detailed? It seems you finished your own solution with `I will still search for other ways of deleting the unnecessary data`, ***which my solution did***. – mrbTT Feb 12 '19 at 16:49
  • The problem is that your code is not optimized as you said and only using rstrip could solve the problem. – Nijat Mursali Feb 13 '19 at 11:24
0

I searched a lot more in the internet and actually find the answer myself.

The rstip() function of pandas solve what I needed.

Firstly: we open the file with pathtofile = "t1.csv" data = pd.read_csv(pathtofile, encoding='utf-8' ,index_col=0), and then for the data file we divide it into columns and then rstripping with specific character like sub = 'window._ttzi'. So the code will be like data['contetn'].str.rstrip(sub).

I will still search for other ways of deleting the unnecessary data. Have a nice day.

Nijat Mursali
  • 930
  • 1
  • 8
  • 20