1

I'm reading lot of log files, from which I generate dictionary by parsing each log, I want to add this dictionary to dataframe, later I use this dataframe for analysis. But the information I need in dataframe may differ every time based on user input. So I don't want all the information in the dictionary to add in to data frame. I want the columns I defined in the data frame only to add to data frame.

As of now I'm adding all the dictionaries one by one to a list, then loading this dictionary to dataframe.

for log in log_lines:
    # here logic to parse the log and generate the dictionary
    my_dict_list.append(d)
pd.Dataframe(my_dict_list)

In this way it adds all the keys and their values to the dataframe, but what I want is, I will define some columns, let's say user asks ['a','b','c'] columns for analysis, I want the dataframe to load only these keys and their values to the data frame, rest should be ignored.

my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
                {'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
                {'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]

Note: I don't want this ignoring keys at the time extraction of logs, because I will be extracting lot of logs so its time consuming.

is there a way I can achieve this, using pandas in faster way?.

Murali
  • 364
  • 2
  • 11
  • 1
    I suggest you to create a dataframe with all values and the remove columns which user doesn't require. Because, removing key value pair in list of dict is slower than removing columns in pandas. to remove columns use this https://stackoverflow.com/questions/51167612/what-is-the-best-way-to-remove-column-in-pandas – Mohamed Thasin ah Jul 25 '19 at 06:17
  • @MohamedThasinah yeah I can do that, but is there any way we can define dataframe to take only the columns that we defined? in that way we can read the data faster and avoid redoing the stuff – Murali Jul 25 '19 at 06:21
  • you can't skip columns while calling pd.DataFrame(), you can pass `usecols` using `read_csv` anyway for that also you need your data as a file. If you have requirement of use same data in multiple places, I suggest you to convert your data as file and use `usecols`. It will save your time. – Mohamed Thasin ah Jul 25 '19 at 06:24
  • Use `reindex` along `axis=1` ....? ie `pd.Dataframe(my_dict_list).reindex(['a', 'b', 'c'], axis=1)` – Chris Adams Jul 25 '19 at 06:39

2 Answers2

0

I am just providing you some raw logic for your query i may be wrong on some part but if you find it helpful for you that will be very great you can mail me also for future queries I will be happy to help you.

  columns = []
  x = int(input('enter no of columns you need'))

  for i in range(x):
       print("Please specify columns")
       columns = int(input())
       columns.append(columns)

    my_dict_list =[ {'a':'abc','b':'123','c':'hello', 'date':'20-5-2019'},
            {'a':'dfc','b':'453','c':'user', 'date':'23-5-2019'},
            {'a':'bla','b':'2313','c':'anything', 'date':'25-5-2019'} ]

   for data in range(x):
     value = pd.DataFrame(my_dict_list[columns[data]])
     print(value[[data]])
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
0

In tmp_Dict line you can filter only requested columns and save only requested columns.

def log_dataframe(log_lines, requested_columns):
    for log in log_lines:
        # here logic to parse the log and generate the dictionary

        tmp_Dict = {requested_key : d[requested_key] for requested_key in request_columns}
        my_dict_list.append(tmp_Dict)
    return pd.Dataframe(my_dict_list)
Ilker Kurtulus
  • 357
  • 3
  • 10
  • I kept a note, I don't want the filter to happen at the extraction level because I have to do this for millions of logs. I want this to happen with dataframe if there is a way. – Murali Jul 25 '19 at 07:57
  • then you can ignore tmp_Dict and simply do return pd.Dataframe(my_dict_list)[requested_columns]. However the first solution is more effective in terms of memory. If you want a solution for log parse, then you should share some logs to let us generate codes.. – Ilker Kurtulus Jul 25 '19 at 10:06