it-swarm.cn

在Python中用空格分割字符串 - 保留引用的子字符串

我有一个字符串,如下所示:

this is "a test"

我正在尝试用Python编写一些东西,用空格分割,同时忽略引号内的空格。我正在寻找的结果是:

['this','is','a test']

PS。我知道你会问“如果报价中有引号会发生什么,那么,在我的申请中,这将永远不会发生。

230
Adam Pierce

你想从 shlex 模块拆分。

>>> import shlex
>>> shlex.split('this is "a test"')
['this', 'is', 'a test']

这应该完全符合你的要求。

349
Jerub

看一下shlex模块,特别是shlex.split

>>> import shlex
>>> shlex.split('This is "a test"')
['This', 'is', 'a test']
54
Allen

我看到这里的正则表达式看起来很复杂和/或错误。这让我感到惊讶,因为正则表达式语法可以很容易地描述“空白或者被引用的东西包围”,并且大多数正则表达式引擎(包括Python)可以在正则表达式上分割。所以如果你要使用正则表达式,为什么不直接说出你的意思呢?:

test = 'this is "a test"'  # or "this is 'a test'"
# pieces = [p for p in re.split("( |[\\\"'].*[\\\"'])", test) if p.strip()]
# From comments, use this:
pieces = [p for p in re.split("( |\\\".*?\\\"|'.*?')", test) if p.strip()]

说明:

[\\\"'] = double-quote or single-quote
.* = anything
( |X) = space or X
.strip() = remove space and empty-string separators

然而,shlex可能提供更多功能。

32
Kate

根据您的使用情况,您可能还想查看csv模块:

import csv
lines = ['this is "a string"', 'and more "stuff"']
for row in csv.reader(lines, delimiter=" "):
    print row

输出:

['this', 'is', 'a string']
['and', 'more', 'stuff']
25
Ryan Ginstrom

我使用shlex.split处理70,000,000行鱿鱼日志,它太慢了。所以我改用了。

如果你有shlex的性能问题,请试试这个。

import re

def line_split(line):
    return re.findall(r'[^"\s]\S*|".+?"', line)
12
Daniel Dai

由于此问题标有正则表达式,我决定尝试使用正则表达式方法。我首先用\ x00替换引号部分中的所有空格,然后用空格分割,然后将\ x00替换回每个部分中的空格。

两个版本都做同样的事情,但拆分器比splitter2更具可读性。

import re

s = 'this is "a test" some text "another test"'

def splitter(s):
    def replacer(m):
        return m.group(0).replace(" ", "\x00")
    parts = re.sub('".+?"', replacer, s).split()
    parts = [p.replace("\x00", " ") for p in parts]
    return parts

def splitter2(s):
    return [p.replace("\x00", " ") for p in re.sub('".+?"', lambda m: m.group(0).replace(" ", "\x00"), s).split()]

print splitter2(s)
8
gooli

似乎出于性能原因re更快。这是我使用保留外引号的最少贪婪运算符的解决方案:

re.findall("(?:\".*?\"|\S)+", s)

结果:

['this', 'is', '"a test"']

它将aaa"bla blub"bbb等结构放在一起,因为这些标记不是用空格分隔的。如果字符串包含转义字符,则可以匹配:

>>> a = "She said \"He said, \\\"My name is Mark.\\\"\""
>>> a
'She said "He said, \\"My name is Mark.\\""'
>>> for i in re.findall("(?:\".*?[^\\\\]\"|\S)+", a): print(i)
...
She
said
"He said, \"My name is Mark.\""

请注意,这也通过模式的""部分匹配空字符串\S

3
hochl

要保留引号,请使用此函数:

def getArgs(s):
    args = []
    cur = ''
    inQuotes = 0
    for char in s.strip():
        if char == ' ' and not inQuotes:
            args.append(cur)
            cur = ''
        Elif char == '"' and not inQuotes:
            inQuotes = 1
            cur += char
        Elif char == '"' and inQuotes:
            inQuotes = 0
            cur += char
        else:
            cur += char
    args.append(cur)
    return args
3
THE_MAD_KING

不同答案的速度测试:

import re
import shlex
import csv

line = 'this is "a test"'

%timeit [p for p in re.split("( |\\\".*?\\\"|'.*?')", line) if p.strip()]
100000 loops, best of 3: 5.17 µs per loop

%timeit re.findall(r'[^"\s]\S*|".+?"', line)
100000 loops, best of 3: 2.88 µs per loop

%timeit list(csv.reader([line], delimiter=" "))
The slowest run took 9.62 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.4 µs per loop

%timeit shlex.split(line)
10000 loops, best of 3: 50.2 µs per loop
2
har777

接受的shlex方法的主要问题是它不会忽略引用子字符串之外的转义字符,并且在某些极端情况下会给出略微意外的结果。

我有以下用例,我需要一个拆分函数来拆分输入字符串,以便保留单引号或双引号子字符串,并能够在这样的子字符串中转义引号。不带引号的字符串中的引号不应与任何其他字符区别对待。一些具有预期输出的示例测试用例:

 输入字符串|预期产出
 =========================================== ==== 
'abc def'| ['abc','def'] 
“abc \\ s def”| ['abc','\\ s','def'] 
'“abc def”ghi'| ['abc def','ghi'] 
“'abc def'ghi”| ['abc def','ghi'] 
'“abc \\”def“ghi'| ['abc”def','ghi'] 
“'abc \\'def' ghi“| [“abc'def”,“ghi”] 
“'abc \\ s def'ghi”| ['abc \\ s def','ghi'] 
'“abc \\ s def”ghi'| ['abc \\ s def','ghi'] 
'“”测试'| ['','测试'] 
“''测试'| ['','测试'] 
“abc'def”| [“abc'def”] 
“abc'def'”| [“abc'def'”] 
“abc'def'ghi”| [“abc'def'”,'ghi'] 
“abc'def'ghi”| [“abc'def'ghi”] 
'abc“def'| ['abc”def'] 
'abc“def”'| ['abc“def”'] 
'abc“def”ghi'| ['abc“def”','ghi'] 
'abc“def”ghi'| ['abc“def”ghi'] 
“r'AA'r'。* _ xyz $'”| [“r'AA'”,“r'。* _ xyz $'”]

我最终使用以下函数来拆分字符串,以便所有输入字符串的预期输出结果:

import re

def quoted_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") \
            for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

以下测试应用程序检查其他方法的结果(现在为shlexcsv)和自定义拆分实现:

#!/bin/python2.7

import csv
import re
import shlex

from timeit import timeit

def test_case(fn, s, expected):
    try:
        if fn(s) == expected:
            print '[ OK ] %s -> %s' % (s, fn(s))
        else:
            print '[FAIL] %s -> %s' % (s, fn(s))
    except Exception as e:
        print '[FAIL] %s -> exception: %s' % (s, e)

def test_case_no_output(fn, s, expected):
    try:
        fn(s)
    except:
        pass

def test_split(fn, test_case_fn=test_case):
    test_case_fn(fn, 'abc def', ['abc', 'def'])
    test_case_fn(fn, "abc \\s def", ['abc', '\\s', 'def'])
    test_case_fn(fn, '"abc def" ghi', ['abc def', 'ghi'])
    test_case_fn(fn, "'abc def' ghi", ['abc def', 'ghi'])
    test_case_fn(fn, '"abc \\" def" ghi', ['abc " def', 'ghi'])
    test_case_fn(fn, "'abc \\' def' ghi", ["abc ' def", 'ghi'])
    test_case_fn(fn, "'abc \\s def' ghi", ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"abc \\s def" ghi', ['abc \\s def', 'ghi'])
    test_case_fn(fn, '"" test', ['', 'test'])
    test_case_fn(fn, "'' test", ['', 'test'])
    test_case_fn(fn, "abc'def", ["abc'def"])
    test_case_fn(fn, "abc'def'", ["abc'def'"])
    test_case_fn(fn, "abc'def' ghi", ["abc'def'", 'ghi'])
    test_case_fn(fn, "abc'def'ghi", ["abc'def'ghi"])
    test_case_fn(fn, 'abc"def', ['abc"def'])
    test_case_fn(fn, 'abc"def"', ['abc"def"'])
    test_case_fn(fn, 'abc"def" ghi', ['abc"def"', 'ghi'])
    test_case_fn(fn, 'abc"def"ghi', ['abc"def"ghi'])
    test_case_fn(fn, "r'AA' r'.*_xyz$'", ["r'AA'", "r'.*_xyz$'"])

def csv_split(s):
    return list(csv.reader([s], delimiter=' '))[0]

def re_split(s):
    def strip_quotes(s):
        if s and (s[0] == '"' or s[0] == "'") and s[0] == s[-1]:
            return s[1:-1]
        return s
    return [strip_quotes(p).replace('\\"', '"').replace("\\'", "'") for p in re.findall(r'"(?:\\.|[^"])*"|\'(?:\\.|[^\'])*\'|[^\s]+', s)]

if __== '__main__':
    print 'shlex\n'
    test_split(shlex.split)
    print

    print 'csv\n'
    test_split(csv_split)
    print

    print 're\n'
    test_split(re_split)
    print

    iterations = 100
    setup = 'from __main__ import test_split, test_case_no_output, csv_split, re_split\nimport shlex, re'
    def benchmark(method, code):
        print '%s: %.3fms per iteration' % (method, (1000 * timeit(code, setup=setup, number=iterations) / iterations))
    benchmark('shlex', 'test_split(shlex.split, test_case_no_output)')
    benchmark('csv', 'test_split(csv_split, test_case_no_output)')
    benchmark('re', 'test_split(re_split, test_case_no_output)')

输出:

shlex 
 
 [OK] abc def  - > ['abc','def'] 
 [FAIL] abc\s def  - > ['abc','s', 'def'] 
 [OK]“abc def”ghi  - > ['abc def','ghi'] 
 [确定]'abc def'ghi  - > ['abc def', 'ghi'] 
 [OK]“abc \”def“ghi  - > ['abc”def','ghi'] 
 [FAIL]'abc \'def'ghhi  - >例外:没有收盘报价
 [确定]'abc\s def'ghi  - > ['abc \\ s def','ghi'] 
 [确定]“abc\s def”ghi  - > ['abc \\ s def','ghi'] 
 [确定]“”测试 - > ['','测试'] 
 [确定]''测试 - > [' ',''测试'] 
 [失败] abc'def  - >例外:没有收尾报价
 [失败] abc'def' - > ['abcdef'] 
 [失败] ] abc'def'ghi  - > ['abcdef','ghi'] 
 [FAIL] abc'def'ghi  - > ['abcdefghi'] 
 [FAIL] abc“def  - >例外:没有收盘报价
 [FAIL] abc“def” - > ['abcdef'] 
 [FAIL] abc“def”ghi  - > ['abcdef','ghi'] 
 [FAIL] abc“def”ghi  - > ['abcdefghi'] 
 [FAIL] r'AA'r'。* _ xyz $' - > ['rAA','r。* _ xyz $ '] 
 
 csv 
 
 [确定] abc def  - > ['abc','de f'] 
 [确定] abc\s def  - > ['abc','\\ s','def'] 
 [确定]“abc def”ghi  - > ['abc def','ghi'] 
 [FAIL]'abc def'ghi  - > [“'abc”,“def'”,“ghi”] 
 [FAIL]“abc”def “ghi  - > ['abc \\','def'','ghi'] 
 [FAIL]'abc \'def'ghi  - > [”'abc“,”\\'“,” def'“,'ghi'] 
 [FAIL]'abc\s def'ghi  - > [”'abc','\\ s',“def'”,'ghi'] 
 [OK]“abc\s def”ghi  - > ['abc \\ s def','ghi'] 
 [确定]“”测试 - > ['','测试'] 
 [失败]''测试 - > [“''','测试'] 
 [确定] abc'def  - > [”abc'def“] 
 [确定] abc' def' - > [“abc'def'”] 
 [确定] abc'def'ghi  - > [“abc'def'”,'ghi'] 
 [确定] abc'def 'ghi  - > [“abc'def'ghi”] 
 [确定] abc“def  - > ['abc”def'] 
 [确定] abc“def” - > ['abc “def”'] 
 [OK] abc“def”ghi  - > ['abc“def”','ghi'] 
 [OK] abc“def”ghi  - > ['abc “def”ghi'] 
 [OK] r'AA'r'。* _ xyz $' - > [“r'AA'”,“r'。* _ xyz $'”] 
 
 re 
 
 [确定] abc def  - > ['abc','def'] 
 [确定] abc\s def  - > ['abc' ,'\\ s','def'] 
 [OK]“abc def”ghi  - > ['abc def','ghi'] 
 [确定]'abc def'ghi  - > ['abc def','ghi'] 
 [确定]“abc”def“ghi  - > ['abc”def','ghi'] 
 [确定]'abc \'def'ghi  - > [“abc'def” ,'ghi'] 
 [确定]'abc\s def'ghi  - > ['abc \\ s def','ghi'] 
 [确定]“abc\s def”ghi - > ['abc \\ s def','ghi'] 
 [确定]“”测试 - > ['','测试'] 
 [确定]''测试 - > [ '','测试'] 
 [确定] abc'def  - > [“abc'def”] 
 [确定] abc'def' - > [“abc'def'”] 
 [确定] abc'def'ghi  - > [“abc'def'”,'ghi'] 
 [确定] abc'def'ghi  - > [“abc'def'ghi”] 
 [确定] abc“def  - > ['abc”def'] 
 [确定] abc“def” - > ['abc“def”'] 
 [确定] abc“def”ghi  - > ['abc“def”','ghi'] 
 [确定] abc“def”ghi  - > ['abc“def”ghi'] 
 [确定] r'AA'r'。* _ xyz $' - > [“r'AA'”,“r'。* _ xyz $'”] 
 
 shlex:每次迭代0.281ms 
 csv:每次迭代0.030ms 
 re:每次迭代0.049ms

因此性能比shlex好得多,并且可以通过预编译正则表达式进一步改进,在这种情况下,它将胜过csv方法。

2
Ton van den Heuvel

为了解决一些Python 2版本中的unicode问题,我建议:

from shlex import split as _split
split = lambda a: [b.decode('utf-8') for b in _split(a.encode('utf-8'))]
2
moschlar

上面讨论的shlex的unicode问题(最佳答案)似乎在2.7.2+中解决(间接),如 http://bugs.python.org/issue6988#msg146200

(单独回答,因为我无法发表评论)

1
Tyris

嗯,似乎无法找到“回复”按钮...无论如何,这个答案是基于Kate的方法,但正确地拆分包含转义引号的子串的字符串,并删除子串的开始和结束引号:

  [i.strip('"').strip("'") for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]

这适用于像'This is " a \\\"test\\\"\\\'s substring"'这样的字符串(不幸的是,疯狂的标记是不必要的,以防止Python删除转义)。

如果不想在返回列表中的字符串中生成转义,则可以使用此函数的稍微更改版本:

[i.strip('"').strip("'").decode('string_escape') for i in re.split(r'(\s+|(?<!\\)".*?(?<!\\)"|(?<!\\)\'.*?(?<!\\)\')', string) if i.strip()]
1
user261478

我建议:

测试字符串:

s = 'abc "ad" \'fg\' "kk\'rdt\'" zzz"34"zzz "" \'\''

捕捉“”和“”:

import re
re.findall(r'"[^"]*"|\'[^\']*\'|[^"\'\s]+',s)

结果:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz', '""', "''"]

忽略空“”和“”:

import re
re.findall(r'"[^"]+"|\'[^\']+\'|[^"\'\s]+',s)

结果:

['abc', '"ad"', "'fg'", '"kk\'rdt\'"', 'zzz', '"34"', 'zzz']
0
hussic