1

I am trying to categorize columns and values (column=value) meaningfully from an input string using Python dictionaries.

input_string = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"

I have created dictionaries of key value pairs. In the first scenario, the key is the column name. The value represents the lowest index of key found in input_string.

Here is the dictionary of column names:

 dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}

In the above dictionary, 'status' has the lowest index of 4 in the input_string.


Similarly, here is the dictionary of values:

dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}

The question is:
How to get the expected ouput as:

list_parsed_values = ['processing', 'hl year 30 arm', 'ryan']

and the (optional) corresponding list of columns as:

list_parsed_columns = ['status', 'product subtypes', 'applicant name']

How to clearly distinguish the values in a list?

martineau
  • 119,623
  • 25
  • 170
  • 301
User456898
  • 5,704
  • 5
  • 21
  • 37
  • 1
    working on raw data (unstructed), I suggest you to use `regex` here – akash karothiya Apr 25 '17 at 10:48
  • 2
    please, add more examples of input and desired output – Azat Ibrakov Apr 25 '17 at 10:55
  • 1
    An idea: `re.split` with `r'\b(?:status|product subtypes|applicant name)\b'`, and [remove all stopwords from the items received](http://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python). Discard empty elements. To know which type of information it is, you might split with the same pattern as above but remove `?:`. Then you could check each odd column for value and even column for key. – Wiktor Stribiżew Apr 25 '17 at 11:12

1 Answers1

2

Check the following approach:

  • Build the regex to remove irrelevant words from the results based on the English nltk stopword list
  • Build the regex to split the text with using the dict_columns keys
  • After splitting, zip the resulting list into a tuple list
  • Remove the irrelevant words from the values and strip the whitespace

Here is the code I have come so far:

import nltk, re
s = "the status is processing and product subtypes are HL year 30 ARM and applicant name is Ryan"
dict_columns = {'status': 4, 'product subtypes': 29, 'applicant name': 69}
dict_values = {'processing': 14, 'hl': 50, 'year': 53, '30': 58, 'arm': 61, 'ryan': 87}
# Build the regex to remove irrelevant words from the results
rx_stopwords = r"\b(?:{})\b".format("|".join([x for x in nltk.corpus.stopwords.words("English")]))
# Build the regex to split the text with using the dict_columns keys
rx_split = r"\b({})\b".format("|".join([x for x in dict_columns]))
chunks = re.split(rx_split, s)
# After splitting, zip the resulting list into a tuple list
it = iter(chunks[1:])
lst = list(zip(it, it))
# Remove the irrelevant words from the values and trim them (this can be further enhanced
res = [(x, re.sub(rx_stopwords, "", y).strip()) for x, y in lst]
# =>
#   [('status', 'processing'), ('product subtypes', 'HL year 30 ARM'), ('applicant name', 'Ryan')]
# It can be cast to a dictionary
dict(res)
# => 
#   {'product subtypes': 'HL year 30 ARM', 'status': 'processing', 'applicant name': 'Ryan'}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563