1

I have a list which hold data that was scraped from a website online. The list is something like this

list1 = ['\nJob Description\n\nDESCRIPTION: Interacts with users and technical team members to analyze requirements and develop
technical design specifications.  Troubleshoot complex issues and make recommendations to improve efficiency and accurac
y. Interpret complex data, analyze results using statistical techniques and provide ongoing reports. Identify, analyze,
and interpret trends or patterns in complex data sets. Filter and "clean data, review reports, and performance indicator
s to locate and correct code problems. Work closely with management to prioritize business and information needs. Locate
 and define new process improvement opportunities. Employ excellent interpersonal and verbal communication skills necess
ary to effectively coordinate interrelated activities with coworkers, end-users, and management. Works autonomously with
 minimal supervision. Provides technical guidance and mentoring to other team members. Multi tasks and balances multiple
 assignments and priorities. Provides timely status updates.\nQUALIFICATIONS: Proven 5 years working experience as a dat
a analyst Technical expertise regarding data models, database design development, data mining and segmentation technique
s Knowledge of and experience with reporting packages (preferably Microsoft BI Stack), databases (SQL, DB2 etc.), and qu
ery language (SQL) Knowledge of statistics and experience using statistical packages for analyzing large datasets Strong
 analytical skills with the ability to collect, organize, analyze, and disseminate significant amounts of information wi
th attention to detail and accuracy Adept at queries, report writing and presenting findings\nNTT DATA is a leading IT s.............]

How do I remove "\n"

Keeping in mind that it has to be done in a loop while scrapping so that the data is scraped, "\n" and the unwanted spaces are removed and the data is pushed into the csv.

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • 1
    Needs to be done in Python – Mohammed Yusuf Khan Apr 09 '17 at 00:40
  • I am pretty sure python has a replace() method. If it does, it would be something like list.replaceAll("\\n", ""); – SedJ601 Apr 09 '17 at 00:41
  • see http://stackoverflow.com/a/4791169/3696510 – muttonUp Apr 09 '17 at 00:43
  • Yup there is string replace, alternatively can try use merge `list2 = map ( lambda x: ' '.join(filter(None, x.split('\n'))), list1) ` . Filter to not sure what you mean should be done while scrapping, which spaces are wanted and which not, and what to push into csv, what columns and so on. Too much guessing... Provide you code – Serge Apr 09 '17 at 01:04
  • @Serge that alternative isn't a recommended method, if it can be done without map and filter, dont use map and filter. Since they creates some unnecessary functions that will slow your code down. – Taku Apr 09 '17 at 01:07
  • Removing map: `for i in range(len(list1)): list1[i] = " ".join(x.split())` You can also create new list, might be better for your needs. This hack not only more fun than find replace but removes duplicate spaces (or use RE). @TigerhawkT3 It is not exact duplicate: removal of unwanted spacing is not just simple find replace (RE at the least needed) – Serge Apr 09 '17 at 01:55
  • @abccd Not everybody would agree with your performance comment http://leadsift.com/loop-map-list-comprehension/ – Serge Apr 09 '17 at 02:05
  • 1
    @Serge - I've added an original for that as well. Let me know if you want me to add any more. – TigerhawkT3 Apr 09 '17 at 02:19
  • @Serge try it your self, use the timeit module, it shows a major advantage to using list comprehension. map + filter = 1.19301, list comprehension = 0.62941. https://repl.it/HBxc/0 – Taku Apr 09 '17 at 04:10
  • @Serge even your own source doesn't agree with you: a direct quote from your source's conclusion *If you require a list of results almost always use a list comprehension. If no results are required, using a simple loop is simpler to read and faster to run. Never use the builtin map, unless its more aesthetically appealing for that piece of code and your application does not need the speed improvement.* emphasis on the ***Never use the builtin map*** – Taku Apr 09 '17 at 04:16

2 Answers2

7

try this:

list2 = [x.replace('\n', '') for x in list1]

it uses list comprehension to iterate through list1 and creates a new list out of the original members with str.replace called on each item to replace the \n with an empty string.

more on python list comprehensions here.

To remove spaces change the code above to

list2 = [x.replace('\n', '').replace(' ', '') for x in list1]
jath03
  • 2,029
  • 2
  • 14
  • 20
  • As stated in [answer], please avoid answering unclear, broad, SW rec, typo, opinion-based, unreproducible, or duplicate questions. Write-my-code requests and low-effort homework questions are off-topic for [so] and more suited to professional coding/tutoring services. Good questions adhere to [ask], include a [mcve], have research effort, and have the potential to be useful to future visitors. Answering inappropriate questions harms the site by making it more difficult to navigate and encouraging further such questions, which can drive away other users who volunteer their time and expertise. – TigerhawkT3 Apr 09 '17 at 01:17
  • @TigerhawkT3 Downvoting an answer just because the question isn't ideal is poor form. The answer does serve to answer the question. Downvote the question not the answer. – Labrys Knossos Apr 15 '17 at 02:55
1

Removing the \n from the individual strings is fairly simple.

line = '\nJob Description\n\nDESCRIPTION:'
line.replace('\n', ' ')

You aren't very specific about what constitutes 'unwanted spaces' but with the simple assumption that it means two spaces in a row a simple approach would be .replace(' ', ' ') to remove doubled spaces. Chain the two together and you end up with:

line.replace('\n', ' ').replace('  ', ' ')

This is both simple and fast. However it doesn't remove all excess spaces. For example a sequence of 3 or 4 spaces would become 2 spaces. Instead you can use a combination of split and join to remove all excess whitespace.

' '.join(line.split())`

This splits the string at all whitespace (including newlines, tabs and other whitespace) and rejoins them using a single space. If it does not meet your needs, a regular expression can be used, however regex parsing is not as efficient but is much more powerful.

import re
re.sub('\s{2,}', ' ', line)

This replaces 2 or more spaces with a single space.

Whichever method you use to clean up a single string, you still need to apply it to each element in the list. If the method you choose is more complex, you should turn it into a method:

def process(line):
    return line.replace('\n', ' ').replace('  ',  ' ')

A naive approach would be to rebuild the list with each element processed. For example using a list generator:

processed_results = [process(line) for line in list]

With a really large list this can be really inefficient. The best approach is to use a generator which only processes a single element at a time without rebuilding the entire list.

generated_results = (process(line) for line in list1)

Notice how it looks almost identical to the string comprehension method. You can iterate through it just like with a list:

for result in generated_results:
    # do something

Keep in mind generators are consumed upon use so if you need to iterate through the results more than once, you may need to use a list instead. A generator can be turned into a list though simply by doing:

processed_results = list(generated_results)

TL;DR

The simplest most efficient method would be to use split and join to remove excess whitespace, and use a generator for efficiency to avoid rebuilding the entire list:

generated_results = (' '.join(line.split) for line in list1)
Labrys Knossos
  • 393
  • 2
  • 7